Friday, 18 December 2015

Understanding GROUP BY and when to use it (SQL)

This is a little guide I wrote for second year computer science students at KCL doing the database module. But anyone is welcome to benefit from this.

---

Imagine we have a list of students and each of these students attend a single college. We create the students table and colleges tables containing data like:

Table:Students
+---+----------+-----+
| 1 | Dave     | KCL |
| 2 | Gemma    | KCL |
| 3 | Ali      | UCL |
| 4 | Adam     | CWC |
| 5 | Claudia  | UCL |
+---+----------------+

Table:Colleges
+-----+------------------------------+
| UCL | University College London    |
| KCL | King's College London        |
| CWC | City of Westminister College |
| WLC | West London College          |
+-----+------------------------------+


Next, imagine the dean of KCL asks you: How many students attend each college?
So now you have to flex your SQL skills and you want to give him a list of college codes and the number of students attending them. Something like this:

+-----+---+
| WLC | 0 |
| CWC | 1 |
| KCL | 2 |
| UCL | 2 |
+-----+---+


But how do we achieve this? Let's give it a try and go through some common mistakes. First, let's do it the naive (newb) way:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code;
Result:
+-----+---+
| CWC | 5 |
+-----+---+


Not what we wanted, right? Why does this happen? That's right, COUNT() is an aggregation function and it returns one row containing the count of all the students, and along with it the first row of the college code.

You realise you need to use GROUP BY:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code;

+-----+---+
| CWC | 1 |
| KCL | 2 |
| UCL | 2 |
+-----+---+


Observe that because we use an INNER JOIN, WLC is not in the list. How do we correct this? That's right, you use a LEFT JOIN so all the colleges are listed.

So that's got us the correct result. Now, you think to yourself, what if he had asked me for a list of colleges with 2 or more students? So you consider using a condition in the WHERE clause:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
WHERE
num_students >= 2
GROUP BY
Colleges.code;


Result:
ERROR: Unknown column 'num_students' in 'where clause'.

OK. So we try again with COUNT(*) >= 2.

Result:
ERROR: Invalid use of group function.

So what is the correct solution?! You realise that you need to apply a condition that is related to the aggregated resultset that is returned by the query.
When you use aggregate functions like COUNT and GROUP BY, you would need to use "HAVING" like so:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code
HAVING
num_students >= 2;

+-----+---+
| KCL | 2 |
| UCL | 2 |
+-----+---+


Just what we wanted here!

Anyway, you get back to our original task - get a list of colleges and the number of students that attend them. You know how to answer it but you want to list the colleges ordered by num_students.
You use LEFT JOIN and ORDER BY to achieve this using the query:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
LEFT JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code
ORDER BY
num_students DESC, code ASC;


And this gives you the output that will impress the dean, impress all who know you and establish you as an SQL guru in the department.

+-----+---+
| KCL | 2 |
| UCL | 2 |
| CWC | 1 |
| WLC | 0 |
+-----+---+



Sampling k-combinations with linear memory and time - the bitmask/shift trick

k-combinations are like permutations but what they achieve is a string of length n made up of an alphabet of sigma (s) characters. Imagine we are dealing with RNA (alphabet = {a, u, t, g}) and we want the different combinations of triplet codons (n = 3), we know that there are sn or 43 = 64 combinations: (aaa, aau, aat, aag, uaa, uau, ..., etc). While we could generate the combinations and save them all to an array, we might come across a problem if we want a longer string. Let's say we increase n to 30, the combinations of 430 is 1152921504606846976 combinations. That's significantly more than 64, and we would struggle to generate and store that many combinations in memory. Basically, the size of our array grows exponentially with the increasing n. The longer the string you need, the more time and memory is required to produce and store all the combinations.

What I'm going to briefly explain is a way to get any combination without an expensive algorithm to generate all the combinations in order to store it in main memory or on disk. There is no need for a large array because our combinations are generated on the fly. The time complexity of getting any particular permutation is linear so it is useful for sampling or getting a small set of combinations without having to generate all of them.

Let's say we want combinations of length n:

int n = 3;

First we stick the alphabet in an array:

int s = 4;
char alphabet[s] = {'a', 'u', 'c', 'g'};

Then we work out how many bits are needed to represent the alphabet characters by rounding-up the square-root of the alphabet size:

int bits = ceil(√s); //sqrt(4) = 2

The number of combinations is of course sn:

int numCombinations = pow(s, n); //64

So imagine we want to print out all the combinations, all we have to do is to loop numCombinations times and decode the index of the loop (i) into a combination using a bitmask:

unsigned long long int i, j, c;
unsigned long long int mask = (1 << bits) - 1; //00000011
for (i = 0; i < numCombinations; i++) {
    c = i;
    for (j = 0; j < n; j++) {
        printf("%c", alphabet[ (mask & c) ]);
        c = c >> bits;
    }
    printf("\n");
}

Basically, if you wanted a single random combination, just pick a random number between 0 and 63 inclusive. Create the bitmask and for each character in n shuffle the bit along a set number of bits.

Imagine we chose the number 57 at random. This is 111001 in binary. We can read two bits at a time by doing a bitwise AND using the mask (000011) and then we do a bitwise right-shift two positions:

000011 & 111001 = 000001 //index 1 (u)
shift c right 2 bits.
000011 & 001110 = 000010 //index 2 (c)
shift c right 2 bits.
000011 & 000011 = 000011 //index 3 (g)

When we print out the characters from our alphabet at index 1, 2 and 3 and together we get: ucg.

This technique is really straightforward while (bits * n) is less than or equal to the number of bits in a computer word (w). That is the immediate technical limitation. If you use a unsigned long long int to store your combinations then you can use up to 64 bits on a 64 bit computer. If you need to use more than that then it will get a little more complicated as you juggle bits across multiple computer words.

PIP installer not working (not installing Django and other packages)

I was trying to install Django using the PIP installer today and I came across this error:

~$ sudo pip install -U django
Searching for install
Reading https://pypi.python.org/simple/install/
Couldn't find index page for 'install' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or download links found for install
error: Could not find suitable distribution for Requirement.parse('install')

I found out the way to fix this problem was to reinstall pip using the easy installer and then update pip.

~$ sudo easy_install pip
~$ sudo pip install --upgrade pip

And this allowed me to install Django successfully.