Ahmad's Blog: 2017

-- Create the example table

CREATE TABLE nobel_laureates (

laureate_id INT(10) NOT NULL AUTO_INCREMENT,

year INT(4) DEFAULT NULL,

field VARCHAR(50),

fname VARCHAR(50) NOT NULL,

lname VARCHAR(50) NOT NULL,

UNIQUE(laureate_id)

);

-- Populate with some data with NULLs in it signifying repeated data

INSERT INTO nobel_laureates (year, field, fname, lname) VALUES

(2016, 'Physics', 'David', 'Thouless'),

(NULL, NULL, 'Duncan', 'Haldane'),

(NULL, NULL, 'John', 'Kosterlitz'),

(NULL, 'Chemistry', 'Jean-Pierre', 'Sauvage'),

(NULL, NULL, 'Fraser', 'Stoddart'),

(NULL, NULL, 'Ben', 'Feringa'),

(NULL, 'Physiology or Medicine', 'Yoshinori', 'Ohsumi'),

(NULL, 'Literature', 'Bob', 'Dylan'),

(NULL, 'Peace', 'Juan Manuel', 'Santos'),

(NULL, 'Economics', 'Oliver', 'Hart'),

(NULL, NULL, 'Bengt', 'Holmstrom'),

(2017, 'Physics', 'Rainer', 'Weiss'),

(NULL, NULL, 'Barry', 'Barish'),

(NULL, NULL, 'Kip', 'Thorne'),

(NULL, 'Chemistry', 'Jacques', 'Dubochet'),

(NULL, NULL, 'Joachim', 'Frank'),

(NULL, NULL, 'Richard', 'Henderson'),

(NULL, 'Physiology or Medicine', 'Jeffrey', 'Hall'),

(NULL, NULL, 'Michael', 'Rosbash'),

(NULL, NULL, 'Michael', 'Young'),

(NULL, 'Literature', 'Kazuo', 'Ishiguro'),

(NULL, 'Peace', 'International Campaign to ', 'Abolish Nuclear Weapons'),

(NULL, 'Economics', 'Richard', 'Thaler');

-- Keeping a backup just in case something goes wrong

CREATE TABLE nl_bkp AS SELECT * FROM nobel_laureates;

Solution 1: Create a lookup table containing filled-in information that can be used to update the rows of this table. This solution works by running a SELECT sub-query on every row to fill in the value of the column. That means we need to run the UPDATE query for as many columns (d) are required. This has a performance downside because it needs to run d times for n rows, a complexity of O(d*n), which is slow, as well as requiring a temporary table and a lot of SQL to solve the problem.

-- Creating a temporary table without repeating NULL rows

CREATE TABLE nl_filled AS

SELECT * FROM nobel_laureates WHERE year IS NOT NULL OR field IS NOT NULL ORDER BY laureate_id;

-- Fill in year using nl_filled

UPDATE nobel_laureates n

SET n.year = (SELECT f.year FROM nl_filled f WHERE f.laureate_id < n.laureate_id AND f.year IS NOT NULL ORDER BY f.laureate_id DESC LIMIT 1)

WHERE n.year IS NULL ORDER BY n.laureate_id;

-- Fill in field using nl_filled

UPDATE nobel_laureates n

SET n.field = (SELECT f.field FROM nl_filled f WHERE f.laureate_id < n.laureate_id AND f.field IS NOT NULL ORDER BY f.laureate_id DESC LIMIT 1)

WHERE n.field IS NULL ORDER BY n.laureate_id;

-- Cleanup

DROP TABLE nl_filled;

DROP TABLE nobel_laureates;

CREATE TABLE nobel_laureates AS SELECT * FROM nl_bkp;

Using the year query to explain what is going on: the UPDATE query runs through each row one by one in ORDER of laureate_id to update only the rows where year IS NULL. The inner SELECT query returns one row (LIMIT 1), the highest previous row as determined ORDER BY laureate_id DESC, from the filled-in table (nl_filled) and year is selected to update the nobel_laureates year field.

Solution 2: Use temporary variables and the COALESCE() function to store values from the previous row and assign it to the current row in order to update the value. This solution is ingenious because it requires little SQL code, updates all columns simultaneously and is fast, taking only O(n) time to solve the problem.

SET @yr = NULL;

SET @fld = NULL;

UPDATE nobel_laureates SET

year = (@yr := COALESCE(year, @yr)),

field = (@fld := COALESCE(field, @fld))

ORDER BY laureate_id;

I will be using the year column again to explain what is happening: first, we create a temporary variable @yr which we assign NULL. Then in the UPDATE query, we use COALESCE(year, @yr). The COALESCE function takes 2 or more arguments and returns the first non-NULL value it finds in the list, so for the first row that is seen in the table:

+-------------+------+------------------------+----------------------------+-------------------------+

+-------------+------+------------------------+----------------------------+-------------------------+

+-------------+------+------------------------+----------------------------+-------------------------+

The value in @yr is NULL and the value in column year is 2016 - so COALESCE returns 2016. This is assigned back to @yr using the special in-query assignment operator := and the new value stored in @yr is now 2016. So when it comes to the next row where year is NULL, the COALESCE function is called again and it compares NULL and @yr, and returns @yr's 2016 value to update the year column.

It is necessary to use ORDER BY laureate_id in the query because the UPDATE command needs to run on each row in order, otherwise the result would not make sense or not be complete.

In conclusion, you can solve this problem in at least two ways in MySQL but using temporary variables and COALESCE is the best solution. Elsa had to solve this problem through lots of googling and reading stackoverflow responses but you are lucky because I posted the solution here and you are welcome to use it.

Last year my university's file servers failed and luckily they kept a backup of my home drive. When they finally restored the backups, they sent me some cryptic instructions along the lines of:

Your Informatics Home was not affected by the data loss and is available under: smb:\\india.inf.kcl.ac.uk\k1234567\

So I googled around about how to access this. I recognized the 'smb' protocol to be a SAMBA protocol and tried to find a way to connect to the drive, but I found nothing useful. Even searching for SMB connection with Linux or how to install samba on Linux gave me nothing.

It's only until I realized I needed to search "how to mount network drive in Linux" that I found the correct solution as described on this page: http://ubuntuhandbook.org/index.php/2014/08/map-network-drive-onto-ubuntu-14-04/.

You can follow the instructions on that page verbatim. In the case of samba shares, the only difference is I removed the 'smb:' part and replaced the back slashes (\\, \) with forward slashes. So the final line in the /etc/fstab file was:

//india.inf.kcl.ac.uk/k1234567 /media/k1234567 cifs credentials=/home/k1234567/.smbcredentials,iocharset=utf8,gid=100,uid=1000,file_mode=0777,dir_mode=0777 0 0

I hope this helps out anyone who is wondering what to do with smb urls.

Ahmad's Blog

Friday, 1 December 2017

Getting DNA out of a FASTA file by position or chromosome id

Wednesday, 22 November 2017

MySQL: Using COALESCE and temporary variables to fill out empty column cells

Tuesday, 24 January 2017

Accessing SMB network drives on Linux (Ubuntu)

About Me