Friday 18 December 2015

Understanding GROUP BY and when to use it (SQL)

This is a little guide I wrote for second year computer science students at KCL doing the database module. But anyone is welcome to benefit from this.

---

Imagine we have a list of students and each of these students attend a single college. We create the students table and colleges tables containing data like:

Table:Students
+---+----------+-----+
| 1 | Dave     | KCL |
| 2 | Gemma    | KCL |
| 3 | Ali      | UCL |
| 4 | Adam     | CWC |
| 5 | Claudia  | UCL |
+---+----------------+

Table:Colleges
+-----+------------------------------+
| UCL | University College London    |
| KCL | King's College London        |
| CWC | City of Westminister College |
| WLC | West London College          |
+-----+------------------------------+


Next, imagine the dean of KCL asks you: How many students attend each college?
So now you have to flex your SQL skills and you want to give him a list of college codes and the number of students attending them. Something like this:

+-----+---+
| WLC | 0 |
| CWC | 1 |
| KCL | 2 |
| UCL | 2 |
+-----+---+


But how do we achieve this? Let's give it a try and go through some common mistakes. First, let's do it the naive (newb) way:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code;
Result:
+-----+---+
| CWC | 5 |
+-----+---+


Not what we wanted, right? Why does this happen? That's right, COUNT() is an aggregation function and it returns one row containing the count of all the students, and along with it the first row of the college code.

You realise you need to use GROUP BY:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code;

+-----+---+
| CWC | 1 |
| KCL | 2 |
| UCL | 2 |
+-----+---+


Observe that because we use an INNER JOIN, WLC is not in the list. How do we correct this? That's right, you use a LEFT JOIN so all the colleges are listed.

So that's got us the correct result. Now, you think to yourself, what if he had asked me for a list of colleges with 2 or more students? So you consider using a condition in the WHERE clause:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
WHERE
num_students >= 2
GROUP BY
Colleges.code;


Result:
ERROR: Unknown column 'num_students' in 'where clause'.

OK. So we try again with COUNT(*) >= 2.

Result:
ERROR: Invalid use of group function.

So what is the correct solution?! You realise that you need to apply a condition that is related to the aggregated resultset that is returned by the query.
When you use aggregate functions like COUNT and GROUP BY, you would need to use "HAVING" like so:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code
HAVING
num_students >= 2;

+-----+---+
| KCL | 2 |
| UCL | 2 |
+-----+---+


Just what we wanted here!

Anyway, you get back to our original task - get a list of colleges and the number of students that attend them. You know how to answer it but you want to list the colleges ordered by num_students.
You use LEFT JOIN and ORDER BY to achieve this using the query:

SELECT
Colleges.code, COUNT(*) AS num_students
FROM
Colleges
LEFT JOIN
Students ON Colleges.code = Students.college_code
GROUP BY
Colleges.code
ORDER BY
num_students DESC, code ASC;


And this gives you the output that will impress the dean, impress all who know you and establish you as an SQL guru in the department.

+-----+---+
| KCL | 2 |
| UCL | 2 |
| CWC | 1 |
| WLC | 0 |
+-----+---+



No comments:

Post a Comment