Friday, 18 December 2015

Understanding GROUP BY and when to use it (SQL)

This is a little guide I wrote for second year computer science students at KCL doing the database module. But anyone is welcome to benefit from this.

---

Imagine we have a list of students and each of these students attend a single college. We create the students table and colleges tables containing data like:

Table:Students
+---+----------+-----+
| 1 | Dave     | KCL |
| 2 | Gemma    | KCL |
| 3 | Ali      | UCL |
| 4 | Adam     | CWC |
| 5 | Claudia  | UCL |
+---+----------------+

Table:Colleges
+-----+------------------------------+
| UCL | University College London    |
| KCL | King's College London        |
| CWC | City of Westminister College |
| WLC | West London College          |
+-----+------------------------------+


Next, imagine the dean of KCL asks you: How many students attend each college?
So now you have to flex your SQL skills and you want to give him a list of college codes and the number of students attending them. Something like this:

+-----+---+
| WLC | 0 |
| CWC | 1 |
| KCL | 2 |
| UCL | 2 |
+-----+---+


But how do we achieve this? Let's give it a try and go through some common mistakes. First, let's do it the naive (newb) way:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code;
Result:
+-----+---+
| CWC | 5 |
+-----+---+


Not what we wanted, right? Why does this happen? That's right, COUNT() is an aggregation function and it returns one row containing the count of all the students, and along with it the first row of the college code.

You realise you need to use GROUP BY:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code;

+-----+---+
| CWC | 1 |
| KCL | 2 |
| UCL | 2 |
+-----+---+


Observe that because we use an INNER JOIN, WLC is not in the list. How do we correct this? That's right, you use a LEFT JOIN so all the colleges are listed.

So that's got us the correct result. Now, you think to yourself, what if he had asked me for a list of colleges with 2 or more students? So you consider using a condition in the WHERE clause:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
WHERE
num_students >= 2
GROUP BY
Colleges.code;


Result:
ERROR: Unknown column 'num_students' in 'where clause'.

OK. So we try again with COUNT(*) >= 2.

Result:
ERROR: Invalid use of group function.

So what is the correct solution?! You realise that you need to apply a condition that is related to the aggregated resultset that is returned by the query.
When you use aggregate functions like COUNT and GROUP BY, you would need to use "HAVING" like so:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code
HAVING
num_students >= 2;

+-----+---+
| KCL | 2 |
| UCL | 2 |
+-----+---+


Just what we wanted here!

Anyway, you get back to our original task - get a list of colleges and the number of students that attend them. You know how to answer it but you want to list the colleges ordered by num_students.
You use LEFT JOIN and ORDER BY to achieve this using the query:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
LEFT JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code
ORDER BY
num_students DESC, code ASC;


And this gives you the output that will impress the dean, impress all who know you and establish you as an SQL guru in the department.

+-----+---+
| KCL | 2 |
| UCL | 2 |
| CWC | 1 |
| WLC | 0 |
+-----+---+



Sampling k-combinations with linear memory and time - the bitmask/shift trick

k-combinations are like permutations but what they achieve is a string of length n made up of an alphabet of sigma (s) characters. Imagine we are dealing with RNA (alphabet = {a, u, t, g}) and we want the different combinations of triplet codons (n = 3), we know that there are sn or 43 = 64 combinations: (aaa, aau, aat, aag, uaa, uau, ..., etc). While we could generate the combinations and save them all to an array, we might come across a problem if we want a longer string. Let's say we increase n to 30, the combinations of 430 is 1152921504606846976 combinations. That's significantly more than 64, and we would struggle to generate and store that many combinations in memory. Basically, the size of our array grows exponentially with the increasing n. The longer the string you need, the more time and memory is required to produce and store all the combinations.

What I'm going to briefly explain is a way to get any combination without an expensive algorithm to generate all the combinations in order to store it in main memory or on disk. There is no need for a large array because our combinations are generated on the fly. The time complexity of getting any particular permutation is linear so it is useful for sampling or getting a small set of combinations without having to generate all of them.

Let's say we want combinations of length n:

int n = 3;

First we stick the alphabet in an array:

int s = 4;
char alphabet[s] = {'a', 'u', 'c', 'g'};

Then we work out how many bits are needed to represent the alphabet characters by rounding-up the square-root of the alphabet size:

int bits = ceil(√s); //sqrt(4) = 2

The number of combinations is of course sn:

int numCombinations = pow(s, n); //64

So imagine we want to print out all the combinations, all we have to do is to loop numCombinations times and decode the index of the loop (i) into a combination using a bitmask:

unsigned long long int i, j, c;
unsigned long long int mask = (1 << bits) - 1; //00000011
for (i = 0; i < numCombinations; i++) {
    c = i;
    for (j = 0; j < n; j++) {
        printf("%c", alphabet[ (mask & c) ]);
        c = c >> bits;
    }
    printf("\n");
}

Basically, if you wanted a single random combination, just pick a random number between 0 and 63 inclusive. Create the bitmask and for each character in n shuffle the bit along a set number of bits.

Imagine we chose the number 57 at random. This is 111001 in binary. We can read two bits at a time by doing a bitwise AND using the mask (000011) and then we do a bitwise right-shift two positions:

000011 & 111001 = 000001 //index 1 (u)
shift c right 2 bits.
000011 & 001110 = 000010 //index 2 (c)
shift c right 2 bits.
000011 & 000011 = 000011 //index 3 (g)

When we print out the characters from our alphabet at index 1, 2 and 3 and together we get: ucg.

This technique is really straightforward while (bits * n) is less than or equal to the number of bits in a computer word (w). That is the immediate technical limitation. If you use a unsigned long long int to store your combinations then you can use up to 64 bits on a 64 bit computer. If you need to use more than that then it will get a little more complicated as you juggle bits across multiple computer words.

PIP installer not working (not installing Django and other packages)

I was trying to install Django using the PIP installer today and I came across this error:

~$ sudo pip install -U django
Searching for install
Reading https://pypi.python.org/simple/install/
Couldn't find index page for 'install' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or download links found for install
error: Could not find suitable distribution for Requirement.parse('install')

I found out the way to fix this problem was to reinstall pip using the easy installer and then update pip.

~$ sudo easy_install pip
~$ sudo pip install --upgrade pip

And this allowed me to install Django successfully.

Monday, 15 June 2015

An online Dynamic Programming matrix viewer

I decided to create a single page application that generates and displays the Dynamic Programming matrix of two given strings. This is for any bioinformatician or computer scientist who needs a quick and easy DP tool they can use online and be able to view the matrix that is output: DPMatrix.

It supports different algorithms - Pattern matching, Global Alignment (e.g. Needleman-Wunsch) and Local Alignment (e.g. Smith-Waterman). And it supports two DP models - Edit distance and Hamming distance. And it allows you to enter different penalty values.

http://webmasterar.github.io/dpmatrix/

Monday, 26 January 2015

Error running Valgrind on Gentoo Linux: the strlen problem

If you have ever tried to install Valgrind on Gentoo Linux you may have come across this error when you try to run the program:

==20032== Memcheck, a memory error detector
==20032== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==20032== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==20032== Command: ./1.x
==20032==

valgrind: Fatal error at startup: a function redirection
valgrind: which is mandatory for this platform-tool combination
valgrind: cannot be set up. Details of the redirection are:
valgrind:
valgrind: A must-be-redirected function
valgrind: whose name matches the pattern: strlen
valgrind: in an object with soname matching: ld-linux-x86-64.so.2
valgrind: was not found whilst processing
valgrind: symbols from the object with soname: ld-linux-x86-64.so.2
valgrind:
valgrind: Possible fixes: (1, short term): install glibc's debuginfo
valgrind: package on this machine. (2, longer term): ask the packagers
valgrind: for your Linux distribution to please in future ship a non-
valgrind: stripped ld.so (or whatever the dynamic linker .so is called)
valgrind: that exports the above-named function using the standard
valgrind: calling conventions for this platform. The package you need
valgrind: to install for fix (1) is called
valgrind:
valgrind: On Debian, Ubuntu: libc6-dbg
valgrind: On SuSE, openSuSE, Fedora, RHEL: glibc-debuginfo
valgrind:
valgrind: Cannot continue -- exiting now. Sorry.


As it turns out, this is a common error but the internet is awash with confusing information so I will try to help by explain what I did to get it working.

You don't need to reinstall Valgrind. It's fine. But what you do need to do is re-install glibc but with special options enabled. Open up the make.conf file in vi or your favourite editor:

vi /etc/make.conf

The things you need to add are shown in bold:

CFLAGS="-march=nocona -O2 -pipe -ggdb -fno-builtin-strlen"

FEATURES="preserve-libs nostrip splitdebug"

When you save these changes you can reinstall glibc:

sudo emerge glibc

Come back in about half an hour and then Valgrind should work. It took me forever to get it working so I hope I save you some time. If you have the chance I recommend using a different distro instead of Gentoo. Use something with a lot of user support like Ubuntu.