Sunday, 27 November 2016

Working with huge files in Unix/Linux

The Unix or GNU/Linux command line has a variety of tools that are essential when working with gigantic files. Bioinformaticians often work with files that are many gigabytes in size. Recently, I was working with very large VCF and FASTA files and wanted to check their contents. Both files came gzipped so I needed to decompress them first.

~$ gzip -d file.vcf.gz
~$ gzip -d seq.fasta.gz

Now, what I had at the end of this was an 11GB VCF file and a 250MB fasta file - way too big to open up in any text editor. But I can use a variety of utilities available on the Linux command line to work with it.

To view the first few lines I could use head like so:

~$ head file.vcf

By default, this did not display enough lines to satisfy my requirements. I needed to view the first 100 lines so I used the -n paramter like so:

~$ head -n 100 file.vcf

I can also use head to grab the first few characters if I am reading from a FASTA file with a really long sequence on a single line. If I wanted to grab the first 1000 characters I can write:

~$ head -c 1000 seq.fasta

I can also use the same parameter to grab the last few characters or lines from the file using tail like so:

~$ tail -c 1000 seq.fasta

This is all good and well but what if I want to inspect only certain lines from a file - say from lines 10 - 20? I can use awk for that:

~$ awk 'NR==10,NR==20' file.vcf

I can also use awk to skip the first 10 lines like so:

~$ awk 'NR>10' file.vcf 

OK. What if I want to count how many characters are in the second line of a FASTA file where, in some cases, a whole sequence put onto one line? I can use a combination of awk and wc to count the letters:

~$ awk 'NR==2' seq.fasta | wc -c

If you want to count how many lines and how many characters there are in a file you can use wc like this passing it both the l and c flags:

~$ wc -lc seq.fasta

Now, if I want to navigate through the whole VCF file to see what it contained, I could not use a text editor. Instead, I would use 'less'. 'less' is better than 'more' if you are more familiar with that. It allows you to navigate to certain line numbers and you can use PgUp and PgDn and the up and down arrows to navigate through a file. You press Q to quit back to your console when you are done.

~$ less file.vcf

Less is OK for some things but it takes quite some time to read a really really big file. It may sometimes be easier to make a smaller sample file from the big file so you can comfortably inspect it. Use the redirection operator '>' to output say 1000 lines to a new file called smaller.vcf:

~$ head -n 1000 file.vcf > smaller.vcf

OK. There's something I didn't mention before which may save you a lot of time if you are in a rush or really short of disk space - you don't need spend all that time decompressing the VCF or FASTA files! You can decompress them on the fly on the command line and pipe a little of the decompressed output to one of those command line tools mentioned above. For example, you just want to grab 1000 lines from file.vcf.gz which is 11GB when decompressed, but you don't want to wait an hour for it to decompress and you only have 4GB of free disk space to spare. Try this:

~$ gzip -dc file.vcf.gz | head -n 1000 > just1000lines.vcf

The trick is that gzip can decompress things out to the console using the -c flag. So while it's decompressing, the output is piped to head, which continues to collect data until 1000 lines are read and then instead of printing all those lines out to the console, we redirect the output to insert into a file: just1000lines.vcf. This all happens very quickly and efficiently in memory and only 1000 lines are written to disk, saving you time and disk space.

Hopefully these command line tools - gzip, head, tail, awk, wc, less - and the redirection operator '>' will help you when you are working with large files.

Using 1000 genomes data files

The 1000 genomes project was established in 2008, a few years after the human genome project (HGP) was completed in 2001. While the HGP was aimed at producing a single reference genome for Homo sapiens, the goal of the 1000 genomes project was to identify the variation present within the species by sequencing 1000s of individuals using fast and cheaper next generation sequencing (NGS) technology. Both projects put their data on line for researchers to download for free.

FASTA Files

The HGP packages the human genome into FASTA format files with one file for every human chromosome (1..22, X, Y) and an accessory FASTA file for the mitochondrial DNA (mtDNA). The FASTA format is a really simple text file format and its suffix tends to end in .fa, .fas or .fasta. The first line of the file consists of a descriptive header beginning with the greater-than symbol '>' followed by the name and/or description and/or unique identifier of the sequence. The next lines after that is the sequence. Typically, 60 characters are present on a line, followed by a new line character '\n', and this pattern goes on till the end of the sequence. Here is an example of part of the Human Cytochrome C gene:

>chromosome:GRCh38:7:25119491:25125961:-1 CTGTCTCTGCCGCAAATGCAGCACCTTCCTCAGTCTTGGGGCGTCCATCAAGCGGAGAGC TGGAATGTGACTGAAAGTAACAGAGTATAGGTGGAACTAGAACAAGGTCACGAGCTGGTG CCCAACTCGTGTAACTCCTGGTTCCAGGCTGTCCCTCTAGGATGCTGTTGGCGCCAGCCG TGGTAGTAGGACAGAGTCTGGGTAAAGGTCGCCTGCTGACAAGATGGTAGCTAGGAGACC TCTAGGTGCGCCAAAATTAGGGCGTCTTTTCC...(etc)

As you can see, the first line describes where it is obtained from. GRCh38 is the latest version of the HGP reference sequence. The gene can be found on chromosome 7 and starts from position 25119491 and ends at position 25125961. -1 indicates it is encoded on the reverse strand. The lines after are the actual genetic code.

The HGP package each assembled whole chromosome's worth of genetic data into individual FASTA files of between 50 and 250MB in size. You can download the gzip-compressed files by going to the ENSEMBL FTP site: ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/

An alternative to downloading each chromosome from the ENSEMBL site is to download an 860MB compressed file containing all the chromosomes in one big FASTA file from the 1000 genomes project. In this blog post I suggest downloading the slightly older GRCh37 version of the HGP, specifically file hs37d5.gz, because it is fully compatible with the VCF files mentioned below. You can obtain the reference genome from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/

You don't have to decompress that gigantic FASTA file because I wrote a little utility application to extract specific chromosomes from the large gzip file and output the chromosome in its own FASTA file: https://github.com/webmasterar/extractChromosome


After you download a chromosome file from ENSEMBL you can just decompress it using your favourite decompression utility such as winrar or if you are on Linux you can write the command: gzip -d Homo_sapiens.GRCh38.dna_rm.chromosome.21.fa.gz.

FASTA is a simple and flexible format. There are no restrictions on the number of sequences in a file as long as they each start with a header line followed by the sequence. Such files are typically known as multi-FASTA files, but they don't have a specific file suffix to indicate that. The format allows for comment lines starting with a semi-colon ';'. There are no restrictions on line length either and a whole sequence could be present on a single line. Typically, when reading FASTA files in a program, they are read line by line, stripping line-ending characters and spaces, skipping past empty and comment lines starting with ';', and storing the title and sequence in memory as strings ready to be searched or analysed.

VCF Files

VCF files are also text files but are a more complex format than FASTA. VCF files are used to store variants, typically single nucleotide polymorphisms (SNPs) and consist of a set of mandatory and optional tab-separated columns. They can represent more complex variants consisting of insertions and deletions too. VCF files are used together with FASTA reference chromosome files and would be of little use by themselves. Each row of the VCF file is a record of a SNP/variant and points to a position in the reference chromosome where the variation occurs. So together, it is possible to map all variants from multiple donors onto the reference human genome.

VCF files are often distributed as specially compressed files that might resemble gzip files (since they end with .gz) but are actually compressed using BGZIP. The compression utility BGZIP is part of a package known as Tabix, both of which are now part of the htslib library and packaged with SAMTools. If you ever need to use BGZIP you need only download the Tabix package using your Linux distro's package manager, e.g. sudo apt-get install tabix. Tabix contains a library for reading tab-separated file formats and together with BGZIP it allows applications to read the tab-separated VCF format in its compressed state. This is especially important because VCF files are huge. Chromosome 1's compressed VCF file is 1.2GB in size and when it is decompressed it takes up around 70GB of disk space! Imagine having to decompress all the human chromosomes - a time-consuming and disk-consuming exercise that is ultimately unnecessary due to Tabix and its BGZIP compression algorithm.

The VCF format contains rows separated by lines and each line is separated by tabs for each of the columns. The first few lines in the file are metadata and comments starting with a hash '#' character and contain details about the file and it's corresponding reference sequence, and information about indels and other complex variants as well as details about optional column headers. The mandatory headers are:

#CHROM - the chromosome number
POS - the position that the SNP/variant occurs in the reference sequence
ID - just the ID of the SNP/variant
REF - the base that should be found in the reference sequence at this position
ALT - the alternate base or bases that might be present here*
QUAL - the quality of the read
FILTER - only accept this record if it says 'PASS'
INFO - various details about phasing and such

*The ALT field may contain references to complex variants written in angle brackets like so '<ADD:INS:1>'.

Here is an example record from a VCF file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096
10   60523  rs148087467    T     G       100     PASS    AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28


You can see it is a T to G single nucleotide polymorphism occurring at position 60523 on chromosome 10.

Most VCF files list the variants in ascending order because many software applications that read VCF files expect it to be in order. It is possible to have multiple entries for the same position of a chromosome in VCF files and thus have duplicates. Dealing with this issue will be resolved in a future blog post. Typically, you would not write your own parser to go through the VCF file because it has a complex compression and file format. For a C++ project I used VCFLib.

The 1000 genomes project was performed in three major phases. The first or pilot phase obtained DNA from 179 individuals, the second (Phase 1) from 1092 individuals (including the pilots) and the third phase (Phase 3) from 2504 individuals (including the people from the first two phases). They were sequenced with improving NGS technologies and the data was collected into VCF files for each chromosome. Since this project started in 2008, the reference human genome at the time was GRCh37 and as such the VCF files should use that as it's reference format. This is mentioned in the comments of the phase 3 VCF files. When you download the compressed (.gz) VCF file you will also need to download its accompanying Tabix index file (.tbi) which contains the index that helps Tabix read the file in its compressed state. You DO NOT need to decompress the VCF files. The files for phase 3 can be downloaded from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

In conclusion, to use the 1000 genomes files, you will need to download the reference human genome sequence chromosome by chromosome (or as one big 850MB compressed file), the VCF file for the chromosome and its accompanying TBI file. There are is no need to decompress the files.

Saturday, 6 August 2016

Pattern matching with degenerate characters and don't-cares using bit-vector encoding

The pattern matching problem falls into two practical categories: exact and approximate pattern matching.

In pattern matching we have a text T of length n and a pattern X of length m ≤ n. In exact pattern matching we want to find X in T exactly as it is. In approximate pattern matching we want to find X in T but it doesn't need to match exactly. We define a threshold value k which is the maximum number of optimal changes that can be made to X to make it match a part of T. If more than k changes are required to make X match a part of T then we conclude that it is not in fact a factor of T and move on to look at the next position.

Exact pattern matching is easily solved using the naive method in worst-case O(nm) time. Of course, better algorithms exist to reduce this complexity. The KMP algorithm is famous but it is not the best one to use. Other algorithms include Shift-And/Shift-Or, Horspool, BNDM and BOM, of which the last two are the most effective. They achieve their efficiency using a combination of text skipping - moving past parts of the text that is impossible to hold the pattern - and bit-parallel manipulation. Bit-parallelism takes advantage of the computer processor's ability to quickly recall items from registers and cache and manipulate it in parallel using simple shift operations and logic gate calculations. Exact pattern matching is also quickly achieved using on-line indexing methods such as Aho-Corasick Automata or the suffix tree/array approach which speed up multiple pattern look up.

Approximate pattern matching traditionally relies on Dynamic Programming approaches which have an O(nm) time complexity. Improvements often rely on bit-parallelism, avoiding unnecessary processing by filtering out parts of the text and various clever tricks relying on low k.

In this blog post I will show how to simulate approximate pattern matching using the naive exact pattern matching algorithm. One presumes it is possible to use better exact pattern-matching algorithms to achieve faster results, but the goal of this post is not to make the ultimate algorithm. Instead, it is to explain how to use bit-parallelism to achieve pattern matching for degenerate and don't-care characters.

I am using the DNA/RNA alphabet for this example. DNA consists of the characters A, C, G, and T. RNA substitutes T for U, making its alphabet A, C, G, U. The characters represent nucleotide bases, for example, A stands for Adenine. We introduce a few extra degenerate characters and a don't care character (N) as defined in the IUPAC Alphabet standard:

A.................Adenine
C.................Cytosine
G.................Guanine
T.................Thymine
U.................Uracil
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base

I call a character like R or B degenerate, meaning it can represent more than one single alphabet character. I call N a don't care character because it can represent any base. Note that there are 4 core letters (A[T|U]CG), assuming that T is equivalent to U and 15 characters all together considering. So why do we have this special IUPAC standard? Well, one reason is because there are enzymes that recognise a segment of DNA with some flexibility. The restriction enzyme EcoRII's specificity subunit recognises and cuts DNA when it finds the site CCWGG. W could be A or T. It is useful for biologists to know where it makes its cuts in the genome. Another reason the IUPAC alphabet might be useful is to define DNA sequences where the exact identity of a base is unclear or may contain errors when it is first sequenced.

We can represent every IUPAC character using a binary representation different to the ASCII standard. Through this binary representation and clever use of bit masks it is possible to make R = A or R = G, or N = any base.

First I create an array containing 122 elements whose index match to the ASCII numerical value of all the letters above. For example 'A' is index 65, 'B' is index 66 and so on. I could have made this a smaller array taking only the range A - Y, but I do not like using subtract all the time in my code. Thus, most of the positions will hold the value 0, but for A, I define its value to be 1, 0001 in binary. In fact, I define the 4 primary nucleotide bases as such:

NUC['A'] = 1; //00000001
NUC['C'] = 2; //00000010
NUC['G'] = 4; //00000100
NUC['T'] = 8; //00001000
NUC['U'] = 8//00001000

Notice that the position of the binary ones make each character distinct. For each distinct character a new bit is needed. In our case with an effective alphabet size of 4, it can all fit into the first nibble of a byte.

Now we design the degenerate characters by making them bit-masks of the characters they can represent.

NUC['R'] = 5;  //00000101 - a/g
NUC['Y'] = 10; //00001010 - c/t
NUC['S'] = 6;  //00000110 - g/c
NUC['W'] = 9;  //00001001 - a/t
NUC['K'] = 12; //00001100 - g/t
NUC['M'] = 3;  //00000011 - a/c
NUC['B'] = 14; //00001110 - c/g/t
NUC['D'] = 13; //00001101 - a/g/t
NUC['H'] = 11; //00001011 - a/c/t
NUC['V'] = 7;  //00000111 - a/c/g
NUC['N'] = 15; //00001111 - a/c/g/t

Before we can use this, I need to describe how to find an exact pattern in a text using the regular method. Let's say we have a text T = "GACCAGGAG" and a pattern X = "CCWGG" and we need to find X in T. But because the naive method cannot recognise the degenerate W character in T, it would not be sensible to use X as is. In fact, you would be forced to do two searches, once using CCAGG and the next time CCTGG. This is the naive search method:

i = 0
while i <= n - m:
    j = 0
    while j < m and T[i+j] == X[j]:
        j += 1
        if j == m:
            print "Match found at position: ", i
    i += 1

We use the variable i to keep track of the position in T and j the position in X. The outer while loop moves the pointer i through T one position at a time. So for every position in T, the inner while loop tries to move through X. It only continues to loop through X as long as the current character in T and subsequent characters match that of X until j is equal to the length of the pattern, in which case we record a match. If before j == m there is a mismatch, it breaks out of the inner loop and moves to the next position in T to keep looking for the pattern.

OK. If that is sufficiently clear, we can modify the above exact pattern matching strategy to match degenerate characters and don't cares based on the following principles.

Each character A[T|U]CG in the NUC array we defined above has a unique binary code with the position of its binary 1 differing. We use logical AND (&) to check if one character matches another.

  NUC['A']: 00000001
& NUC['A']: 00000001

--------------------
            00000001

  NUC['A']: 00000001
& NUC['C']: 00000010
--------------------
            00000000

So when A matches A, the resulting character is A and when it doesn't, it gives a different character (0). With degenerate characters we do:

  NUC['A']: 00000001
& NUC['W']: 00001001
--------------------
            00000001

  NUC['T']: 00001000
& NUC['W']: 00001001
--------------------
            00001000

  NUC['C']: 00000010
& NUC['W']: 00001001
--------------------
            00000000

As you can see, W matches either A or T to give the character it was tested with, and when compared with C, a character it does not represent, it does not match it. The don't care character N will match A, C, G or T/U because it's bit pattern is 00001111. So if we modify our naive algorithm above, we can change the condition of the inner loop so that if a character from X, whether it is A[T|U]CG or a degenerate character, when ANDed with a character from T results in the same character from T, then it is a match.

For ease, I create an array P to hold the binary representation of the characters in X as defined in the NUC array.

P = []
for j = 0 to m - 1:
    P.append(NUC[X[j]])

Then I can easily use it in the modified naive method, which this time supports degenerate and don't care characters:

i = 0
while i <= n - m:
    j = 0
    while j < m and (NUC[T[i+j]] & P[j]) == NUC[T[i+j]]:
        j += 1
        if j == m:
            print "Match found at position: ", i
    i += 1

So there it is. I searched with X = "CCWGG" and it found CCAGG in T. If I had CCTGG in T it would have found it as well. Also, if I didn't care about the value of one or two of the letters, then I could have used N in the pattern and it would have found the pattern I was looking for.

The naive method is limited to searching at O(nm) time complexity in the worst case, but if this technique is incorporated into a better exact string matching algorithm, it will be able to do it much quicker and with the added benefit of finding patterns using a degenerate character set which supports don't cares.

Friday, 29 April 2016

ISO-639-2 list of languages and their directions

I collated a list of the languages (ISO-639-2) and the direction they are most commonly written in and provided it as a TXT or CSV file for download here: https://github.com/webmasterar/languagesWithDirections

Examples:

ara,,ar,Arabic,arabe,rtl
chi,zho,zh,Chinese,chinois,ttb

eng,,en,English,anglais,ltr

This is mostly useful with the correctly displaying text on websites using the HTML bdo tag or CSS styles direction and/or writing-mode.

Sunday, 17 April 2016

Latitude and Longitude to Timezone PHP script

I converted a Java code I found online that was really useful to PHP so anyone can use it now. It basically returns the timezone of a location given its latitude and longitude. Check out the example:


<?php
include 'TimezoneMapper.php';

$london = ['lat' => 51.504380, 'lng' => -0.127727];
echo TimezoneMapper::latLngToTimezoneString($london['lat'], $london['lng']);
//outputs: Europe/London

You can get the code here: https://github.com/webmasterar/TimezoneMapperPHP

Wednesday, 13 April 2016

A quick guide to compiling C/C++ code and writing Makefiles (Part 1: Compiling)


This is the first of a 3 series post. This post will be about compiling. The second post will be about Makefiles. The third post will be about compiling libraries.

C/C++ coding is difficult enough as it is to code and debug without having to spend more time figuring out how write a good Makefile to compile the damn code and fix the Makefile bugs! In this post I will talk about compiling using the command line tools. But usually you don't write these commands manually on the console to build a project and be able to run the application. You write the commands in the Makefile and then you just build the project using the Makefile.

A Makefile is used to store the commands needed to build or clean a project. The user may download your project and just has to write the make command on the console to build the project to create the executable binary file that will run on their machine. In Windows you get given binaries like virus.exe with the .exe executable extension that just works on pretty much all Windows computers as long as there are no architectural requirements (64bit versions of software will not run on 32bit systems). But this is not the case for Unix/Linux. With all the different distros available and their internal complexities, it is not likely that a binary executable built on FreeBSD will also run on Ubuntu. Also there is no quick way to state a file is executable because .exe is not used to denote an executable file in the Linux world. So, for Linux mainly, the source code is often packaged so you can dowload and extract it, and you are expected to compile the code using the make command to create the executable for your machine in order to run it.

Which tool you compile your code with varies by operating system and there is a choice of compilers for each system anyway. On Windows I use MingW and on Linux I use GCC. There are often different compilers for C or C++ code too. It should be noted that C++ compilers are capable of compiling C code but sometimes complain about things that are not acceptable anymore in C++. I recommend compiling C code with a C compiler and C++ code with a C++ compiler. You can mix C and C++ code, but if you do, be prepared to edit your C code.

Anyway, compiling. The C compiler tool in Linux is gcc and the C++ compiler is g++. If you use MinGW to build a project on Windows, it will know what you are trying to compile your code with, so it's cool to carry on using gcc and g++. There are two sequential steps to compiling.
  1. (dependent on the project) Make sure you have linked in any required library files (.a/.so) and their header files (.h).
  2. Compile the source code (.c/.cpp) files into object(.o) files.
  3. Combine the object files into one executable file.
Let's start with step two first because I will write another article about compiling and linking libraries at a later time. The following command will compile the code into object files for a hypothetical project consisting of two source files and the header file associated with the important program: exampleProg.c, reusableCmd.c and reusableCmd.h.

gcc -I . -c exampleProg.c reusableCmd.c

This will create the files exampleProg.o and reusableCmd.o. The flag -I . means find the header files required to compile this program are in the current directory (.). If we needed to include other header files in another directory, we would add an additional I flag like so: -I ./../dir2/libX.

Then we have to compile those files into an executable file. On Linux we don't give it an extension usually, but on Windows we specify it ends with .exe. To compile our files on Linux we do:

gcc -I . -o exampleProg exampleProg.o reusableCmd.o

And there you have it. The code will produce the executable file exampleProg, which is ready to run. The C++ compiler works in much the same way too. It is best to add some additional flags to optimize the executable file or improve debugging. Here are some flags I use a lot in my projects:

-Wall = Show all code warnings
-lm = Load the Math library if you #include <math .h%gt;- you need to add this flag explicitly if you include this header!
-funroll-loops = optimize loops
-O3 = Use maximum processor and memory optimizations
-fomit-frame-pointer = Omit some of the debugging pointers included in the executable - use this only once your code works perfectly
-msse4.2 = Use SSE4.2 machine instructions where available to improve performance - this is especially useful on modern Intel processors.
-std=c99 = This sets the expected standard of the compilation. You can also use an even newer standard such as c11 if your compiler supports it.


In the next post I will talk about Makefiles.

Tuesday, 12 April 2016

Counting Sort (index-based sort) reinvented

So the past weekend I independently came up with what I would later find out was the counting sort and I went as far as implementing it before I found out it already existed. Awkward. I initially called it the quick dictionary sort and afterwards renamed it to the quick tally sort (qtsort) because a "dictionary" is a list of words and not individual characters and numbers which is what this algorithm is used for.

The counting sort is a linear time O(n) sorting algorithm because it does not work by an element value comparison technique, which has a proven best worst case complexity of O(n log n). Many sorting algorithms are O(n2) in the worst case, but satisfyingly the counting sort is linear. That's a revelation when it comes to sorting because it is so fast! The best thing about the counting sort is that it is so elegantly simple and can take up little extra memory too. The space complexity is also very small if the range of the elements is small O(nmax - nmin).

Let's go through an example with a character array to clarify. Imagine a string t made of a set of characters: t = {A,D,B,E,A}. We know that the ascii value of character A is 65, B is 66 and the rest of the alphabet is ordered sequentionally in the same way. We can create an array in the range of characters in t to hold all the unique values. The character A is the smallest value (65) in t and the character E is the largest (69).

We create an array T, which stands for tally, and it should hold the frequency each character occurs in text t. As arrays start at index 0, we can make any character in t, A through D, be represented by an index of our array by subtracting the minimum value from each characters. Also, our array T, will be of size = tmax - tmin + 1, or 5 = 69 - 65 + 1. To let the letter A be represented as the first index (0) we do: index = A - tmin. T looks like this when we initialize it and we haven't gone through the text yet:

 idx|val
+---+---+
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
+---+---+

Notice that there is position 2 in the array even when we don't have the character C in t. We go through each character in t and each time we increment the value at the calculated index of that character, so eventually our T array looks like this. We have two A's, and one each of B, D and E.

 idx|val
+---+---+
| 0 | 2 |
| 1 | 1 |

| 2 | 0 |
| 3 | 1 |
| 4 | 1 |
+---+---+

The T array is filled out using a simple loop which goes through each character of t and increments the counter, so by the end of the loop it has made a tally of how often each character occurs. This action is represented by this bit of psuedo code:

foreach char in t:
    T[char - tmin] = T[char - tmin] + 1

Now that we have filled the tally array, we can go through it in sequential order and wherever the value is 1 or more, we print that many of the character. This can be implemented with a simple for loop through the range of the items in T. With each index in the array, we work out which character it really represents by adding tmin back to it, so with the first element, c = idx + tmin, A= 65 = 0 + 65; and with the next index, B = 66 = 1 + 65. If the value at an index in T is 0, meaning no characters exist for it in t, then we skip it and go to the next index. Here is the psuedo code:

foreach idx,val in T:
    if val > 0:
        c = idx + tmin
        while val > 0:
            print c
            val = val - 1


This simple bit of code would then print out t in the correct order: AABDE. The algorithm runs linear in time O(n), but is limited by the range of the values, so if you had an huge range of numbers it would occupy a lot of memory for T and would need to go through all of T in order to print it out. Therefore the Counting sort is best applied to long strings or arrays of numbers within a small range of values. It is unfortunately unsuitable for use with other types of data such as floats or words. It's role lies in sorting lists of numbers, such as table indexes, and characters very quickly and using little memory. It is better to use it than other sorts for certain datasets and will yield a result faster than O(n log n) time.

You can check out a working example on github written in C and works only with characters currently, but can be modified with ease to work with larger integers. You don't need to specify tmin or tmax because the program works that out in linear time too! Here it is: https://github.com/webmasterar/qtsort