GROUP BY and aggregate functions
Imagine you have the following table ‘users’:
id | username | level | points |
---|---|---|---|
1 | jsmith | 10 | 1000 |
2 | fbloggs | 10 | 2000 |
2 | jdoe | 3 | 100 |
4 | mbloggs | 3 | 1000 |
5 | bdavis | 8 | 500 |
You can do some pretty interesting stuff with this data. You could determine the results of a contest for who can acquire the most points. Let’s try that.
SELECT username, MAX(points) FROM users;
+----------+-------------+ | username | MAX(points) | +----------+-------------+ | jsmith | 2000 | +----------+-------------+
Whoa, hang on a minute. That’s not right. What just happened here?
MAX()
is an aggregate function. What does that mean? It means its job is to take in many rows, perform an operation to combine all the data, and return a single result. We got our single result. But what happened to username? It got aggregated. We asked for two pieces of information that didn’t necessarily have anything to do with each other.
It’s worth pointing out that Postgres would not let you make this mistake.
ERROR: column "users.username" must appear in the GROUP BY clause or be used in an aggregate function.
Let’s use Postgres for now to be safe.
SELECT username, points FROM users ORDER BY points DESC;
username | points ----------+-------- fbloggs | 2000 jsmith | 1000 mbloggs | 1000 bdavis | 500 jdoe | 100
jsmith and mbloggs are tied for second place. Listing users that are tied on the same line, separated by commas would be a better way to see this.
SELECT ARRAY_AGG(username), points FROM users GROUP BY points ORDER BY points DESC;
in Postgres or in MySQL,
SELECT GROUP_CONCAT(username), points FROM users GROUP BY points ORDER BY points DESC;
array_agg | points ------------------+-------- {fbloggs} | 2000 {jsmith,mbloggs} | 1000 {bdavis} | 500 {jdoe} | 100
Can we simplify the SQL at all? What if we want just the user with the most points? What about this:
SELECT ARRAY_AGG(username), MAX(points) from users;
The username column is now in an aggregate function too, so Postgres shouldn’t complain.
array_agg | max --------------------------------------+------ {jsmith,fbloggs,jdoe,mbloggs,bdavis} | 2000
Wow. We’ve managed to shoot ourselves in the foot even with Postgres. Since this is the second time we’ve made a mistake like this, let’s take a moment to dig a bit deeper about why this doesn’t work the way we might think it should in this particular situation.
What is MAX()
? It returns the maximum value for a particular column given a group of rows. It’s a piece of statistical information about a GROUP of rows. You don’t need statistics about a single row.
For instance, it’s more apparent that it wouldn’t make sense to do this:
SELECT id, COUNT(*) FROM users;
The number of rows doesn’t really have much to do with any of the ids.
Picking a Winner Correctly
So how would you accomplish this? Like this:
SELECT MAX(points) FROM users;
max ------ 2000 (1 row)
SELECT username, points FROM users WHERE points=2000;
-or-
SELECT username, points FROM users WHERE points = (SELECT MAX(points) FROM users);
Notice that we use one query (or subquery) to find the highest point value, and a second query to find all users (there could be a tie) with that value. Aggregates aren’t for working with single rows.
Using Aggregate Functions Correctly
So what would be an appropriate way to use aggregate functions and GROUP BY
? What if we wanted to look at the amount of participation of the various user levels?
SELECT level, SUM(points), MIN(points), MAX(points) FROM users GROUP BY level;
level | sum | min | max -------+------+------+------ 8 | 500 | 500 | 500 3 | 1100 | 100 | 1000 10 | 3000 | 1000 | 2000
Here we’re dealing with groups of rows that have something in common: level. We group all users of the same level into a single pseudo-row. Note that, just as it doesn’t make sense to use aggregate functions on a single row, it doesn’t make sense to talk about a single column of a group of rows. Users of level 10 don’t have a single points value, they have many. How do you want them aggregated? Do you want to add them up? Concatenate them into a list? Find the highest?
The one sort of gray area/exception in this case is level 8. Because there is only one row in the group, there is technically only one points value, only one id, etc. It might be helpful to think of these cases not as a single row, but as a set or array containing one row. [5] instead of 5.