Friday 1 December 2017

Getting DNA out of a FASTA file by position or chromosome id

A common task in bioinformatics is to read a FASTA file to get a sequence from it. You might want to grab a whole chromosome from a genome (multi-)FASTA file, or you might want to grab a bit of DNA from a single chromosome by providing the start position and end position (range). In both cases, I have you covered:

For extracting a chromosome from a genome file, like the 900MB GRCh37.gz file provided by the 1000 genomes project, which contains all the human chromosomes in one file - I created a program to extract individual chromosomes to a FASTA file. You don't even need to extract the original .gz file - it handles it as is. Check it out here: https://github.com/webmasterar/extractChromosome

For extracting a string of bases from a chromosome stored in a FASTA file just by providing the start and end positions, I created a little Python helper function called getRef() and you can access it here: https://gist.github.com/webmasterar/3a60155d4ddc8595b17fa2c62893dbb0

It is easy to use and takes three arguments: getRef(Chr_File, Start_Pos, End_Pos).

Wednesday 22 November 2017

MySQL: Using COALESCE and temporary variables to fill out empty column cells

In this post I will be describing two potential solutions to a specific problem in managing human-created data in a MySQL database related to the presence of NULLs in table columns.

Data entry is a tedious and demanding task on humans and the data produced often tends to contain errors and inconsistencies. This presents a problem to the database administrator who needs to manage the quality of the data in the database as well as maintain the relations between tables. One such inconsistency is the presence of empty ("") or NULL fields purposely introduced into a dataset because it contains repeated data. These empty values were introduced by the data entry clerk because they did not feel there was a need to enter the same data in multiple times. This behavior is actually understandable and has a beneficial effect because it lessens the clerk's workload, reducing the time spent on the job as well as reducing the potential for typos because the less that is typed the fewer the number of errors introduced. So it should not be seen as an error but rather it is an inconsistency in the data of the table making it unsuitable for querying and impossible to normalize even up to 1NF. It is up to the database administrator to solve this problem.

Consider the following example:

For the past two years Lucas, the Data Entry Clerk, has been working for the Royal Swedish Academy of Sciences and has been responsible for entering the information about Nobel prize laureates into the database. Unfortunately, while Lucas has been working there, he has been paid little attention as well as a poor salary. Today, Elsa, the new MySQL Database Administrator, came in to look at the data he had collected and realized she had a problem. This was the data she she saw in the table for Nobel laureates collected in 2016 and 2017:

Query: SELECT * FROM nobel_laureates;
+-------------+------+------------------------+----------------------------+-------------------------+
| laureate_id | year | field                  | fname                      | lname                   |
+-------------+------+------------------------+----------------------------+-------------------------+
|           1 | 2016 | Physics                | David                      | Thouless                |
|           2 | NULL | NULL                   | Duncan                     | Haldane                 |
|           3 | NULL | NULL                   | John                       | Kosterlitz              |
|           4 | NULL | Chemistry              | Jean-Pierre                | Sauvage                 |
|           5 | NULL | NULL                   | Fraser                     | Stoddart                |
|           6 | NULL | NULL                   | Ben                        | Feringa                 |
|           7 | NULL | Physiology or Medicine | Yoshinori                  | Ohsumi                  |
|           8 | NULL | Literature             | Bob                        | Dylan                   |
|           9 | NULL | Peace                  | Juan Manuel                | Santos                  |
|          10 | NULL | Economics              | Oliver                     | Hart                    |
|          11 | NULL | NULL                   | Bengt                      | Holmstrom               |
|          12 | 2017 | Physics                | Rainer                     | Weiss                   |
|          13 | NULL | NULL                   | Barry                      | Barish                  |
|          14 | NULL | NULL                   | Kip                        | Thorne                  |
|          15 | NULL | Chemistry              | Jacques                    | Dubochet                |
|          16 | NULL | NULL                   | Joachim                    | Frank                   |
|          17 | NULL | NULL                   | Richard                    | Henderson               |
|          18 | NULL | Physiology or Medicine | Jeffrey                    | Hall                    |
|          19 | NULL | NULL                   | Michael                    | Rosbash                 |
|          20 | NULL | NULL                   | Michael                    | Young                   |
|          21 | NULL | Literature             | Kazuo                      | Ishiguro                |
|          22 | NULL | Peace                  | International Campaign to  | Abolish Nuclear Weapons |
|          23 | NULL | Economics              | Richard                    | Thaler                  |
+-------------+------+------------------------+----------------------------+-------------------------+

So many NULLs in the year and field columns!

It became clear to her what he had done and why. Essentially, he entered the first value for the year 2016 and felt he didn't need to enter it every time for all the laureates of the same year. And for the field column, he had done the same thing, where for three people awarded for Physics in 2016, he needed only to enter the field once for the first person. She realized this needed to be fixed and considered it an interesting challenge. So how does Elsa fill down the correct data in the columns to eradicate the NULLs?

I found two solutions to the problem and there may be more. But first let's recreate the dataset so you can follow along and understand how I go about solving the problem.

-- Create the example table

CREATE TABLE nobel_laureates (
laureate_id INT(10) NOT NULL AUTO_INCREMENT,
year INT(4) DEFAULT NULL,
field VARCHAR(50),
fname VARCHAR(50) NOT NULL,
lname VARCHAR(50) NOT NULL,
UNIQUE(laureate_id)
);

-- Populate with some data with NULLs in it signifying repeated data

INSERT INTO nobel_laureates (year, field, fname, lname) VALUES
(2016, 'Physics', 'David', 'Thouless'),
(NULL, NULL, 'Duncan', 'Haldane'),
(NULL, NULL, 'John', 'Kosterlitz'),
(NULL, 'Chemistry', 'Jean-Pierre', 'Sauvage'),
(NULL, NULL, 'Fraser', 'Stoddart'),
(NULL, NULL, 'Ben', 'Feringa'),
(NULL, 'Physiology or Medicine', 'Yoshinori', 'Ohsumi'),
(NULL, 'Literature', 'Bob', 'Dylan'),
(NULL, 'Peace', 'Juan Manuel', 'Santos'),
(NULL, 'Economics', 'Oliver', 'Hart'),
(NULL, NULL, 'Bengt', 'Holmstrom'),
(2017, 'Physics', 'Rainer', 'Weiss'),
(NULL, NULL, 'Barry', 'Barish'),
(NULL, NULL, 'Kip', 'Thorne'),
(NULL, 'Chemistry', 'Jacques', 'Dubochet'),
(NULL, NULL, 'Joachim', 'Frank'),
(NULL, NULL, 'Richard', 'Henderson'),
(NULL, 'Physiology or Medicine', 'Jeffrey', 'Hall'),
(NULL, NULL, 'Michael', 'Rosbash'),
(NULL, NULL, 'Michael', 'Young'),
(NULL, 'Literature', 'Kazuo', 'Ishiguro'),
(NULL, 'Peace', 'International Campaign to ', 'Abolish Nuclear Weapons'),
(NULL, 'Economics', 'Richard', 'Thaler');

-- Keeping a backup just in case something goes wrong

CREATE TABLE nl_bkp AS SELECT * FROM nobel_laureates;

Solution 1: Create a lookup table containing filled-in information that can be used to update the rows of this table. This solution works by running a SELECT sub-query on every row to fill in the value of the column. That means we need to run the UPDATE query for as many columns (d) are required. This has a performance downside because it needs to run d times for n rows, a complexity of O(d*n), which is slow, as well as requiring a temporary table and a lot of SQL to solve the problem.

-- Creating a temporary table without repeating NULL rows
CREATE TABLE nl_filled AS
SELECT * FROM nobel_laureates WHERE year IS NOT NULL OR field IS NOT NULL ORDER BY laureate_id;

-- Fill in year using nl_filled
UPDATE nobel_laureates n
SET n.year = (SELECT f.year FROM nl_filled f WHERE f.laureate_id < n.laureate_id AND f.year IS NOT NULL ORDER BY f.laureate_id DESC LIMIT 1)
WHERE n.year IS NULL ORDER BY n.laureate_id;
-- Fill in field using nl_filled
UPDATE nobel_laureates n
SET n.field = (SELECT f.field FROM nl_filled f WHERE f.laureate_id < n.laureate_id AND f.field IS NOT NULL ORDER BY f.laureate_id DESC LIMIT 1)
WHERE n.field IS NULL ORDER BY n.laureate_id;

-- Cleanup
DROP TABLE nl_filled;
DROP TABLE nobel_laureates;
CREATE TABLE nobel_laureates AS SELECT * FROM nl_bkp;

Using the year query to explain what is going on: the UPDATE query runs through each row one by one in ORDER of laureate_id to update only the rows where year IS NULL. The inner SELECT query returns one row (LIMIT 1), the highest previous row as determined ORDER BY laureate_id DESC, from the filled-in table (nl_filled) and year is selected to update the nobel_laureates year field.

Solution 2: Use temporary variables and the COALESCE() function to store values from the previous row and assign it to the current row in order to update the value. This solution is ingenious because it requires little SQL code, updates all columns simultaneously and is fast, taking only O(n) time to solve the problem.

SET @yr = NULL;
SET @fld = NULL;

UPDATE nobel_laureates SET
year = (@yr := COALESCE(year, @yr)),
field = (@fld := COALESCE(field, @fld))
ORDER BY laureate_id;

I will be using the year column again to explain what is happening: first, we create a temporary variable @yr which we assign NULL. Then in the UPDATE query, we use COALESCE(year, @yr). The COALESCE function takes 2 or more arguments and returns the first non-NULL value it finds in the list, so for the first row that is seen in the table:

+-------------+------+------------------------+----------------------------+-------------------------+
| laureate_id | year | field                  | fname                      | lname                   |
+-------------+------+------------------------+----------------------------+-------------------------+
|           1 | 2016 | Physics                | David                      | Thouless                |
+-------------+------+------------------------+----------------------------+-------------------------+

The value in @yr is NULL and the value in column year is 2016 - so COALESCE returns 2016. This is assigned back to @yr using the special in-query assignment operator := and the new value stored in @yr is now 2016. So when it comes to the next row where year is NULL, the COALESCE function is called again and it compares NULL and @yr, and returns @yr's 2016 value to update the year column.

It is necessary to use ORDER BY laureate_id  in the query because the UPDATE command needs to run on each row in order, otherwise the result would not make sense or not be complete.

In conclusion, you can solve this problem in at least two ways in MySQL but using temporary variables and COALESCE is the best solution. Elsa had to solve this problem through lots of googling and reading stackoverflow responses but you are lucky because I posted the solution here and you are welcome to use it.

Tuesday 24 January 2017

Accessing SMB network drives on Linux (Ubuntu)

Last year my university's file servers failed and luckily they kept a backup of my home drive. When they finally restored the backups, they sent me some cryptic instructions along the lines of:

Your Informatics Home was not affected by the data loss and is available under: smb:\\india.inf.kcl.ac.uk\k1234567\

So I googled around about how to access this. I recognized the 'smb' protocol to be a SAMBA protocol and tried to find a way to connect to the drive, but I found nothing useful. Even searching for SMB connection with Linux or how to install samba on Linux gave me nothing.

It's only until I realized I needed to search "how to mount network drive in Linux" that I found the correct solution as described on this page: http://ubuntuhandbook.org/index.php/2014/08/map-network-drive-onto-ubuntu-14-04/.

You can follow the instructions on that page verbatim. In the case of samba shares, the only difference is I removed the 'smb:' part and replaced the back slashes (\\, \) with forward slashes. So the final line in the /etc/fstab file was:

//india.inf.kcl.ac.uk/k1234567 /media/k1234567 cifs credentials=/home/k1234567/.smbcredentials,iocharset=utf8,gid=100,uid=1000,file_mode=0777,dir_mode=0777 0 0

I hope this helps out anyone who is wondering what to do with smb urls.