Tuesday 24 December 2019

My Thesis

I forgot about updating the blog with my doctoral thesis. My thesis is titled:



I hope you like it and benefit from it.

About this blog... I'm working on many things at my new job but am not at liberty to discuss them so this is one of the reasons this blog has not been updated and will not be updated for a long time.

Friday 1 December 2017

Getting DNA out of a FASTA file by position or chromosome id

A common task in bioinformatics is to read a FASTA file to get a sequence from it. You might want to grab a whole chromosome from a genome (multi-)FASTA file, or you might want to grab a bit of DNA from a single chromosome by providing the start position and end position (range). In both cases, I have you covered:

For extracting a chromosome from a genome file, like the 900MB GRCh37.gz file provided by the 1000 genomes project, which contains all the human chromosomes in one file - I created a program to extract individual chromosomes to a FASTA file. You don't even need to extract the original .gz file - it handles it as is. Check it out here: https://github.com/webmasterar/extractChromosome

For extracting a string of bases from a chromosome stored in a FASTA file just by providing the start and end positions, I created a little Python helper function called getRef() and you can access it here: https://gist.github.com/webmasterar/3a60155d4ddc8595b17fa2c62893dbb0

It is easy to use and takes three arguments: getRef(Chr_File, Start_Pos, End_Pos).

Wednesday 22 November 2017

MySQL: Using COALESCE and temporary variables to fill out empty column cells

In this post I will be describing two potential solutions to a specific problem in managing human-created data in a MySQL database related to the presence of NULLs in table columns.

Data entry is a tedious and demanding task on humans and the data produced often tends to contain errors and inconsistencies. This presents a problem to the database administrator who needs to manage the quality of the data in the database as well as maintain the relations between tables. One such inconsistency is the presence of empty ("") or NULL fields purposely introduced into a dataset because it contains repeated data. These empty values were introduced by the data entry clerk because they did not feel there was a need to enter the same data in multiple times. This behavior is actually understandable and has a beneficial effect because it lessens the clerk's workload, reducing the time spent on the job as well as reducing the potential for typos because the less that is typed the fewer the number of errors introduced. So it should not be seen as an error but rather it is an inconsistency in the data of the table making it unsuitable for querying and impossible to normalize even up to 1NF. It is up to the database administrator to solve this problem.

Consider the following example:

For the past two years Lucas, the Data Entry Clerk, has been working for the Royal Swedish Academy of Sciences and has been responsible for entering the information about Nobel prize laureates into the database. Unfortunately, while Lucas has been working there, he has been paid little attention as well as a poor salary. Today, Elsa, the new MySQL Database Administrator, came in to look at the data he had collected and realized she had a problem. This was the data she she saw in the table for Nobel laureates collected in 2016 and 2017:

Query: SELECT * FROM nobel_laureates;
+-------------+------+------------------------+----------------------------+-------------------------+
| laureate_id | year | field                  | fname                      | lname                   |
+-------------+------+------------------------+----------------------------+-------------------------+
|           1 | 2016 | Physics                | David                      | Thouless                |
|           2 | NULL | NULL                   | Duncan                     | Haldane                 |
|           3 | NULL | NULL                   | John                       | Kosterlitz              |
|           4 | NULL | Chemistry              | Jean-Pierre                | Sauvage                 |
|           5 | NULL | NULL                   | Fraser                     | Stoddart                |
|           6 | NULL | NULL                   | Ben                        | Feringa                 |
|           7 | NULL | Physiology or Medicine | Yoshinori                  | Ohsumi                  |
|           8 | NULL | Literature             | Bob                        | Dylan                   |
|           9 | NULL | Peace                  | Juan Manuel                | Santos                  |
|          10 | NULL | Economics              | Oliver                     | Hart                    |
|          11 | NULL | NULL                   | Bengt                      | Holmstrom               |
|          12 | 2017 | Physics                | Rainer                     | Weiss                   |
|          13 | NULL | NULL                   | Barry                      | Barish                  |
|          14 | NULL | NULL                   | Kip                        | Thorne                  |
|          15 | NULL | Chemistry              | Jacques                    | Dubochet                |
|          16 | NULL | NULL                   | Joachim                    | Frank                   |
|          17 | NULL | NULL                   | Richard                    | Henderson               |
|          18 | NULL | Physiology or Medicine | Jeffrey                    | Hall                    |
|          19 | NULL | NULL                   | Michael                    | Rosbash                 |
|          20 | NULL | NULL                   | Michael                    | Young                   |
|          21 | NULL | Literature             | Kazuo                      | Ishiguro                |
|          22 | NULL | Peace                  | International Campaign to  | Abolish Nuclear Weapons |
|          23 | NULL | Economics              | Richard                    | Thaler                  |
+-------------+------+------------------------+----------------------------+-------------------------+

So many NULLs in the year and field columns!

It became clear to her what he had done and why. Essentially, he entered the first value for the year 2016 and felt he didn't need to enter it every time for all the laureates of the same year. And for the field column, he had done the same thing, where for three people awarded for Physics in 2016, he needed only to enter the field once for the first person. She realized this needed to be fixed and considered it an interesting challenge. So how does Elsa fill down the correct data in the columns to eradicate the NULLs?

I found two solutions to the problem and there may be more. But first let's recreate the dataset so you can follow along and understand how I go about solving the problem.

-- Create the example table

CREATE TABLE nobel_laureates (
laureate_id INT(10) NOT NULL AUTO_INCREMENT,
year INT(4) DEFAULT NULL,
field VARCHAR(50),
fname VARCHAR(50) NOT NULL,
lname VARCHAR(50) NOT NULL,
UNIQUE(laureate_id)
);

-- Populate with some data with NULLs in it signifying repeated data

INSERT INTO nobel_laureates (year, field, fname, lname) VALUES
(2016, 'Physics', 'David', 'Thouless'),
(NULL, NULL, 'Duncan', 'Haldane'),
(NULL, NULL, 'John', 'Kosterlitz'),
(NULL, 'Chemistry', 'Jean-Pierre', 'Sauvage'),
(NULL, NULL, 'Fraser', 'Stoddart'),
(NULL, NULL, 'Ben', 'Feringa'),
(NULL, 'Physiology or Medicine', 'Yoshinori', 'Ohsumi'),
(NULL, 'Literature', 'Bob', 'Dylan'),
(NULL, 'Peace', 'Juan Manuel', 'Santos'),
(NULL, 'Economics', 'Oliver', 'Hart'),
(NULL, NULL, 'Bengt', 'Holmstrom'),
(2017, 'Physics', 'Rainer', 'Weiss'),
(NULL, NULL, 'Barry', 'Barish'),
(NULL, NULL, 'Kip', 'Thorne'),
(NULL, 'Chemistry', 'Jacques', 'Dubochet'),
(NULL, NULL, 'Joachim', 'Frank'),
(NULL, NULL, 'Richard', 'Henderson'),
(NULL, 'Physiology or Medicine', 'Jeffrey', 'Hall'),
(NULL, NULL, 'Michael', 'Rosbash'),
(NULL, NULL, 'Michael', 'Young'),
(NULL, 'Literature', 'Kazuo', 'Ishiguro'),
(NULL, 'Peace', 'International Campaign to ', 'Abolish Nuclear Weapons'),
(NULL, 'Economics', 'Richard', 'Thaler');

-- Keeping a backup just in case something goes wrong

CREATE TABLE nl_bkp AS SELECT * FROM nobel_laureates;

Solution 1: Create a lookup table containing filled-in information that can be used to update the rows of this table. This solution works by running a SELECT sub-query on every row to fill in the value of the column. That means we need to run the UPDATE query for as many columns (d) are required. This has a performance downside because it needs to run d times for n rows, a complexity of O(d*n), which is slow, as well as requiring a temporary table and a lot of SQL to solve the problem.

-- Creating a temporary table without repeating NULL rows
CREATE TABLE nl_filled AS
SELECT * FROM nobel_laureates WHERE year IS NOT NULL OR field IS NOT NULL ORDER BY laureate_id;

-- Fill in year using nl_filled
UPDATE nobel_laureates n
SET n.year = (SELECT f.year FROM nl_filled f WHERE f.laureate_id < n.laureate_id AND f.year IS NOT NULL ORDER BY f.laureate_id DESC LIMIT 1)
WHERE n.year IS NULL ORDER BY n.laureate_id;
-- Fill in field using nl_filled
UPDATE nobel_laureates n
SET n.field = (SELECT f.field FROM nl_filled f WHERE f.laureate_id < n.laureate_id AND f.field IS NOT NULL ORDER BY f.laureate_id DESC LIMIT 1)
WHERE n.field IS NULL ORDER BY n.laureate_id;

-- Cleanup
DROP TABLE nl_filled;
DROP TABLE nobel_laureates;
CREATE TABLE nobel_laureates AS SELECT * FROM nl_bkp;

Using the year query to explain what is going on: the UPDATE query runs through each row one by one in ORDER of laureate_id to update only the rows where year IS NULL. The inner SELECT query returns one row (LIMIT 1), the highest previous row as determined ORDER BY laureate_id DESC, from the filled-in table (nl_filled) and year is selected to update the nobel_laureates year field.

Solution 2: Use temporary variables and the COALESCE() function to store values from the previous row and assign it to the current row in order to update the value. This solution is ingenious because it requires little SQL code, updates all columns simultaneously and is fast, taking only O(n) time to solve the problem.

SET @yr = NULL;
SET @fld = NULL;

UPDATE nobel_laureates SET
year = (@yr := COALESCE(year, @yr)),
field = (@fld := COALESCE(field, @fld))
ORDER BY laureate_id;

I will be using the year column again to explain what is happening: first, we create a temporary variable @yr which we assign NULL. Then in the UPDATE query, we use COALESCE(year, @yr). The COALESCE function takes 2 or more arguments and returns the first non-NULL value it finds in the list, so for the first row that is seen in the table:

+-------------+------+------------------------+----------------------------+-------------------------+
| laureate_id | year | field                  | fname                      | lname                   |
+-------------+------+------------------------+----------------------------+-------------------------+
|           1 | 2016 | Physics                | David                      | Thouless                |
+-------------+------+------------------------+----------------------------+-------------------------+

The value in @yr is NULL and the value in column year is 2016 - so COALESCE returns 2016. This is assigned back to @yr using the special in-query assignment operator := and the new value stored in @yr is now 2016. So when it comes to the next row where year is NULL, the COALESCE function is called again and it compares NULL and @yr, and returns @yr's 2016 value to update the year column.

It is necessary to use ORDER BY laureate_id  in the query because the UPDATE command needs to run on each row in order, otherwise the result would not make sense or not be complete.

In conclusion, you can solve this problem in at least two ways in MySQL but using temporary variables and COALESCE is the best solution. Elsa had to solve this problem through lots of googling and reading stackoverflow responses but you are lucky because I posted the solution here and you are welcome to use it.

Tuesday 24 January 2017

Accessing SMB network drives on Linux (Ubuntu)

Last year my university's file servers failed and luckily they kept a backup of my home drive. When they finally restored the backups, they sent me some cryptic instructions along the lines of:

Your Informatics Home was not affected by the data loss and is available under: smb:\\india.inf.kcl.ac.uk\k1234567\

So I googled around about how to access this. I recognized the 'smb' protocol to be a SAMBA protocol and tried to find a way to connect to the drive, but I found nothing useful. Even searching for SMB connection with Linux or how to install samba on Linux gave me nothing.

It's only until I realized I needed to search "how to mount network drive in Linux" that I found the correct solution as described on this page: http://ubuntuhandbook.org/index.php/2014/08/map-network-drive-onto-ubuntu-14-04/.

You can follow the instructions on that page verbatim. In the case of samba shares, the only difference is I removed the 'smb:' part and replaced the back slashes (\\, \) with forward slashes. So the final line in the /etc/fstab file was:

//india.inf.kcl.ac.uk/k1234567 /media/k1234567 cifs credentials=/home/k1234567/.smbcredentials,iocharset=utf8,gid=100,uid=1000,file_mode=0777,dir_mode=0777 0 0

I hope this helps out anyone who is wondering what to do with smb urls.

Sunday 27 November 2016

Working with huge files in Unix/Linux

The Unix or GNU/Linux command line has a variety of tools that are essential when working with gigantic files. Bioinformaticians often work with files that are many gigabytes in size. Recently, I was working with very large VCF and FASTA files and wanted to check their contents. Both files came gzipped so I needed to decompress them first.

~$ gzip -d file.vcf.gz
~$ gzip -d seq.fasta.gz

Now, what I had at the end of this was an 11GB VCF file and a 250MB fasta file - way too big to open up in any text editor. But I can use a variety of utilities available on the Linux command line to work with it.

To view the first few lines I could use head like so:

~$ head file.vcf

By default, this did not display enough lines to satisfy my requirements. I needed to view the first 100 lines so I used the -n paramter like so:

~$ head -n 100 file.vcf

I can also use head to grab the first few characters if I am reading from a FASTA file with a really long sequence on a single line. If I wanted to grab the first 1000 characters I can write:

~$ head -c 1000 seq.fasta

I can also use the same parameter to grab the last few characters or lines from the file using tail like so:

~$ tail -c 1000 seq.fasta

This is all good and well but what if I want to inspect only certain lines from a file - say from lines 10 - 20? I can use awk for that:

~$ awk 'NR==10,NR==20' file.vcf

I can also use awk to skip the first 10 lines like so:

~$ awk 'NR>10' file.vcf 

OK. What if I want to count how many characters are in the second line of a FASTA file where, in some cases, a whole sequence put onto one line? I can use a combination of awk and wc to count the letters:

~$ awk 'NR==2' seq.fasta | wc -c

If you want to count how many lines and how many characters there are in a file you can use wc like this passing it both the l and c flags:

~$ wc -lc seq.fasta

Now, if I want to navigate through the whole VCF file to see what it contained, I could not use a text editor. Instead, I would use 'less'. 'less' is better than 'more' if you are more familiar with that. It allows you to navigate to certain line numbers and you can use PgUp and PgDn and the up and down arrows to navigate through a file. You press Q to quit back to your console when you are done.

~$ less file.vcf

Less is OK for some things but it takes quite some time to read a really really big file. It may sometimes be easier to make a smaller sample file from the big file so you can comfortably inspect it. Use the redirection operator '>' to output say 1000 lines to a new file called smaller.vcf:

~$ head -n 1000 file.vcf > smaller.vcf

OK. There's something I didn't mention before which may save you a lot of time if you are in a rush or really short of disk space - you don't need spend all that time decompressing the VCF or FASTA files! You can decompress them on the fly on the command line and pipe a little of the decompressed output to one of those command line tools mentioned above. For example, you just want to grab 1000 lines from file.vcf.gz which is 11GB when decompressed, but you don't want to wait an hour for it to decompress and you only have 4GB of free disk space to spare. Try this:

~$ gzip -dc file.vcf.gz | head -n 1000 > just1000lines.vcf

The trick is that gzip can decompress things out to the console using the -c flag. So while it's decompressing, the output is piped to head, which continues to collect data until 1000 lines are read and then instead of printing all those lines out to the console, we redirect the output to insert into a file: just1000lines.vcf. This all happens very quickly and efficiently in memory and only 1000 lines are written to disk, saving you time and disk space.

Hopefully these command line tools - gzip, head, tail, awk, wc, less - and the redirection operator '>' will help you when you are working with large files.

Using 1000 genomes data files

The 1000 genomes project was established in 2008, a few years after the human genome project (HGP) was completed in 2001. While the HGP was aimed at producing a single reference genome for Homo sapiens, the goal of the 1000 genomes project was to identify the variation present within the species by sequencing 1000s of individuals using fast and cheaper next generation sequencing (NGS) technology. Both projects put their data on line for researchers to download for free.

FASTA Files

The HGP packages the human genome into FASTA format files with one file for every human chromosome (1..22, X, Y) and an accessory FASTA file for the mitochondrial DNA (mtDNA). The FASTA format is a really simple text file format and its suffix tends to end in .fa, .fas or .fasta. The first line of the file consists of a descriptive header beginning with the greater-than symbol '>' followed by the name and/or description and/or unique identifier of the sequence. The next lines after that is the sequence. Typically, 60 characters are present on a line, followed by a new line character '\n', and this pattern goes on till the end of the sequence. Here is an example of part of the Human Cytochrome C gene:

>chromosome:GRCh38:7:25119491:25125961:-1 CTGTCTCTGCCGCAAATGCAGCACCTTCCTCAGTCTTGGGGCGTCCATCAAGCGGAGAGC TGGAATGTGACTGAAAGTAACAGAGTATAGGTGGAACTAGAACAAGGTCACGAGCTGGTG CCCAACTCGTGTAACTCCTGGTTCCAGGCTGTCCCTCTAGGATGCTGTTGGCGCCAGCCG TGGTAGTAGGACAGAGTCTGGGTAAAGGTCGCCTGCTGACAAGATGGTAGCTAGGAGACC TCTAGGTGCGCCAAAATTAGGGCGTCTTTTCC...(etc)

As you can see, the first line describes where it is obtained from. GRCh38 is the latest version of the HGP reference sequence. The gene can be found on chromosome 7 and starts from position 25119491 and ends at position 25125961. -1 indicates it is encoded on the reverse strand. The lines after are the actual genetic code.

The HGP package each assembled whole chromosome's worth of genetic data into individual FASTA files of between 50 and 250MB in size. You can download the gzip-compressed files by going to the ENSEMBL FTP site: ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/

An alternative to downloading each chromosome from the ENSEMBL site is to download an 860MB compressed file containing all the chromosomes in one big FASTA file from the 1000 genomes project. In this blog post I suggest downloading the slightly older GRCh37 version of the HGP, specifically file hs37d5.gz, because it is fully compatible with the VCF files mentioned below. You can obtain the reference genome from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/

You don't have to decompress that gigantic FASTA file because I wrote a little utility application to extract specific chromosomes from the large gzip file and output the chromosome in its own FASTA file: https://github.com/webmasterar/extractChromosome


After you download a chromosome file from ENSEMBL you can just decompress it using your favourite decompression utility such as winrar or if you are on Linux you can write the command: gzip -d Homo_sapiens.GRCh38.dna_rm.chromosome.21.fa.gz.

FASTA is a simple and flexible format. There are no restrictions on the number of sequences in a file as long as they each start with a header line followed by the sequence. Such files are typically known as multi-FASTA files, but they don't have a specific file suffix to indicate that. The format allows for comment lines starting with a semi-colon ';'. There are no restrictions on line length either and a whole sequence could be present on a single line. Typically, when reading FASTA files in a program, they are read line by line, stripping line-ending characters and spaces, skipping past empty and comment lines starting with ';', and storing the title and sequence in memory as strings ready to be searched or analysed.

VCF Files

VCF files are also text files but are a more complex format than FASTA. VCF files are used to store variants, typically single nucleotide polymorphisms (SNPs) and consist of a set of mandatory and optional tab-separated columns. They can represent more complex variants consisting of insertions and deletions too. VCF files are used together with FASTA reference chromosome files and would be of little use by themselves. Each row of the VCF file is a record of a SNP/variant and points to a position in the reference chromosome where the variation occurs. So together, it is possible to map all variants from multiple donors onto the reference human genome.

VCF files are often distributed as specially compressed files that might resemble gzip files (since they end with .gz) but are actually compressed using BGZIP. The compression utility BGZIP is part of a package known as Tabix, both of which are now part of the htslib library and packaged with SAMTools. If you ever need to use BGZIP you need only download the Tabix package using your Linux distro's package manager, e.g. sudo apt-get install tabix. Tabix contains a library for reading tab-separated file formats and together with BGZIP it allows applications to read the tab-separated VCF format in its compressed state. This is especially important because VCF files are huge. Chromosome 1's compressed VCF file is 1.2GB in size and when it is decompressed it takes up around 70GB of disk space! Imagine having to decompress all the human chromosomes - a time-consuming and disk-consuming exercise that is ultimately unnecessary due to Tabix and its BGZIP compression algorithm.

The VCF format contains rows separated by lines and each line is separated by tabs for each of the columns. The first few lines in the file are metadata and comments starting with a hash '#' character and contain details about the file and it's corresponding reference sequence, and information about indels and other complex variants as well as details about optional column headers. The mandatory headers are:

#CHROM - the chromosome number
POS - the position that the SNP/variant occurs in the reference sequence
ID - just the ID of the SNP/variant
REF - the base that should be found in the reference sequence at this position
ALT - the alternate base or bases that might be present here*
QUAL - the quality of the read
FILTER - only accept this record if it says 'PASS'
INFO - various details about phasing and such

*The ALT field may contain references to complex variants written in angle brackets like so '<ADD:INS:1>'.

Here is an example record from a VCF file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096
10   60523  rs148087467    T     G       100     PASS    AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28


You can see it is a T to G single nucleotide polymorphism occurring at position 60523 on chromosome 10.

Most VCF files list the variants in ascending order because many software applications that read VCF files expect it to be in order. It is possible to have multiple entries for the same position of a chromosome in VCF files and thus have duplicates. Dealing with this issue will be resolved in a future blog post. Typically, you would not write your own parser to go through the VCF file because it has a complex compression and file format. For a C++ project I used VCFLib.

The 1000 genomes project was performed in three major phases. The first or pilot phase obtained DNA from 179 individuals, the second (Phase 1) from 1092 individuals (including the pilots) and the third phase (Phase 3) from 2504 individuals (including the people from the first two phases). They were sequenced with improving NGS technologies and the data was collected into VCF files for each chromosome. Since this project started in 2008, the reference human genome at the time was GRCh37 and as such the VCF files should use that as it's reference format. This is mentioned in the comments of the phase 3 VCF files. When you download the compressed (.gz) VCF file you will also need to download its accompanying Tabix index file (.tbi) which contains the index that helps Tabix read the file in its compressed state. You DO NOT need to decompress the VCF files. The files for phase 3 can be downloaded from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

In conclusion, to use the 1000 genomes files, you will need to download the reference human genome sequence chromosome by chromosome (or as one big 850MB compressed file), the VCF file for the chromosome and its accompanying TBI file. There are is no need to decompress the files.

Saturday 6 August 2016

Pattern matching with degenerate characters and don't-cares using bit-vector encoding

The pattern matching problem falls into two practical categories: exact and approximate pattern matching.

In pattern matching we have a text T of length n and a pattern X of length m ≤ n. In exact pattern matching we want to find X in T exactly as it is. In approximate pattern matching we want to find X in T but it doesn't need to match exactly. We define a threshold value k which is the maximum number of optimal changes that can be made to X to make it match a part of T. If more than k changes are required to make X match a part of T then we conclude that it is not in fact a factor of T and move on to look at the next position.

Exact pattern matching is easily solved using the naive method in worst-case O(nm) time. Of course, better algorithms exist to reduce this complexity. The KMP algorithm is famous but it is not the best one to use. Other algorithms include Shift-And/Shift-Or, Horspool, BNDM and BOM, of which the last two are the most effective. They achieve their efficiency using a combination of text skipping - moving past parts of the text that is impossible to hold the pattern - and bit-parallel manipulation. Bit-parallelism takes advantage of the computer processor's ability to quickly recall items from registers and cache and manipulate it in parallel using simple shift operations and logic gate calculations. Exact pattern matching is also quickly achieved using on-line indexing methods such as Aho-Corasick Automata or the suffix tree/array approach which speed up multiple pattern look up.

Approximate pattern matching traditionally relies on Dynamic Programming approaches which have an O(nm) time complexity. Improvements often rely on bit-parallelism, avoiding unnecessary processing by filtering out parts of the text and various clever tricks relying on low k.

In this blog post I will show how to simulate approximate pattern matching using the naive exact pattern matching algorithm. One presumes it is possible to use better exact pattern-matching algorithms to achieve faster results, but the goal of this post is not to make the ultimate algorithm. Instead, it is to explain how to use bit-parallelism to achieve pattern matching for degenerate and don't-care characters.

I am using the DNA/RNA alphabet for this example. DNA consists of the characters A, C, G, and T. RNA substitutes T for U, making its alphabet A, C, G, U. The characters represent nucleotide bases, for example, A stands for Adenine. We introduce a few extra degenerate characters and a don't care character (N) as defined in the IUPAC Alphabet standard:

A.................Adenine
C.................Cytosine
G.................Guanine
T.................Thymine
U.................Uracil
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base

I call a character like R or B degenerate, meaning it can represent more than one single alphabet character. I call N a don't care character because it can represent any base. Note that there are 4 core letters (A[T|U]CG), assuming that T is equivalent to U and 15 characters all together considering. So why do we have this special IUPAC standard? Well, one reason is because there are enzymes that recognise a segment of DNA with some flexibility. The restriction enzyme EcoRII's specificity subunit recognises and cuts DNA when it finds the site CCWGG. W could be A or T. It is useful for biologists to know where it makes its cuts in the genome. Another reason the IUPAC alphabet might be useful is to define DNA sequences where the exact identity of a base is unclear or may contain errors when it is first sequenced.

We can represent every IUPAC character using a binary representation different to the ASCII standard. Through this binary representation and clever use of bit masks it is possible to make R = A or R = G, or N = any base.

First I create an array containing 122 elements whose index match to the ASCII numerical value of all the letters above. For example 'A' is index 65, 'B' is index 66 and so on. I could have made this a smaller array taking only the range A - Y, but I do not like using subtract all the time in my code. Thus, most of the positions will hold the value 0, but for A, I define its value to be 1, 0001 in binary. In fact, I define the 4 primary nucleotide bases as such:

NUC['A'] = 1; //00000001
NUC['C'] = 2; //00000010
NUC['G'] = 4; //00000100
NUC['T'] = 8; //00001000
NUC['U'] = 8//00001000

Notice that the position of the binary ones make each character distinct. For each distinct character a new bit is needed. In our case with an effective alphabet size of 4, it can all fit into the first nibble of a byte.

Now we design the degenerate characters by making them bit-masks of the characters they can represent.

NUC['R'] = 5;  //00000101 - a/g
NUC['Y'] = 10; //00001010 - c/t
NUC['S'] = 6;  //00000110 - g/c
NUC['W'] = 9;  //00001001 - a/t
NUC['K'] = 12; //00001100 - g/t
NUC['M'] = 3;  //00000011 - a/c
NUC['B'] = 14; //00001110 - c/g/t
NUC['D'] = 13; //00001101 - a/g/t
NUC['H'] = 11; //00001011 - a/c/t
NUC['V'] = 7;  //00000111 - a/c/g
NUC['N'] = 15; //00001111 - a/c/g/t

Before we can use this, I need to describe how to find an exact pattern in a text using the regular method. Let's say we have a text T = "GACCAGGAG" and a pattern X = "CCWGG" and we need to find X in T. But because the naive method cannot recognise the degenerate W character in T, it would not be sensible to use X as is. In fact, you would be forced to do two searches, once using CCAGG and the next time CCTGG. This is the naive search method:

i = 0
while i <= n - m:
    j = 0
    while j < m and T[i+j] == X[j]:
        j += 1
        if j == m:
            print "Match found at position: ", i
    i += 1

We use the variable i to keep track of the position in T and j the position in X. The outer while loop moves the pointer i through T one position at a time. So for every position in T, the inner while loop tries to move through X. It only continues to loop through X as long as the current character in T and subsequent characters match that of X until j is equal to the length of the pattern, in which case we record a match. If before j == m there is a mismatch, it breaks out of the inner loop and moves to the next position in T to keep looking for the pattern.

OK. If that is sufficiently clear, we can modify the above exact pattern matching strategy to match degenerate characters and don't cares based on the following principles.

Each character A[T|U]CG in the NUC array we defined above has a unique binary code with the position of its binary 1 differing. We use logical AND (&) to check if one character matches another.

  NUC['A']: 00000001
& NUC['A']: 00000001

--------------------
            00000001

  NUC['A']: 00000001
& NUC['C']: 00000010
--------------------
            00000000

So when A matches A, the resulting character is A and when it doesn't, it gives a different character (0). With degenerate characters we do:

  NUC['A']: 00000001
& NUC['W']: 00001001
--------------------
            00000001

  NUC['T']: 00001000
& NUC['W']: 00001001
--------------------
            00001000

  NUC['C']: 00000010
& NUC['W']: 00001001
--------------------
            00000000

As you can see, W matches either A or T to give the character it was tested with, and when compared with C, a character it does not represent, it does not match it. The don't care character N will match A, C, G or T/U because it's bit pattern is 00001111. So if we modify our naive algorithm above, we can change the condition of the inner loop so that if a character from X, whether it is A[T|U]CG or a degenerate character, when ANDed with a character from T results in the same character from T, then it is a match.

For ease, I create an array P to hold the binary representation of the characters in X as defined in the NUC array.

P = []
for j = 0 to m - 1:
    P.append(NUC[X[j]])

Then I can easily use it in the modified naive method, which this time supports degenerate and don't care characters:

i = 0
while i <= n - m:
    j = 0
    while j < m and (NUC[T[i+j]] & P[j]) == NUC[T[i+j]]:
        j += 1
        if j == m:
            print "Match found at position: ", i
    i += 1

So there it is. I searched with X = "CCWGG" and it found CCAGG in T. If I had CCTGG in T it would have found it as well. Also, if I didn't care about the value of one or two of the letters, then I could have used N in the pattern and it would have found the pattern I was looking for.

The naive method is limited to searching at O(nm) time complexity in the worst case, but if this technique is incorporated into a better exact string matching algorithm, it will be able to do it much quicker and with the added benefit of finding patterns using a degenerate character set which supports don't cares.

Friday 29 April 2016

ISO-639-2 list of languages and their directions

I collated a list of the languages (ISO-639-2) and the direction they are most commonly written in and provided it as a TXT or CSV file for download here: https://github.com/webmasterar/languagesWithDirections

Examples:

ara,,ar,Arabic,arabe,rtl
chi,zho,zh,Chinese,chinois,ttb

eng,,en,English,anglais,ltr

This is mostly useful with the correctly displaying text on websites using the HTML bdo tag or CSS styles direction and/or writing-mode.

Sunday 17 April 2016

Latitude and Longitude to Timezone PHP script

I converted a Java code I found online that was really useful to PHP so anyone can use it now. It basically returns the timezone of a location given its latitude and longitude. Check out the example:


<?php
include 'TimezoneMapper.php';

$london = ['lat' => 51.504380, 'lng' => -0.127727];
echo TimezoneMapper::latLngToTimezoneString($london['lat'], $london['lng']);
//outputs: Europe/London

You can get the code here: https://github.com/webmasterar/TimezoneMapperPHP