GROUP BY and aggregate functions

Imagine you have the following table ‘users’:

id username level points
1 jsmith 10 1000
2 fbloggs 10 2000
2 jdoe 3 100
4 mbloggs 3 1000
5 bdavis 8 500

You can do some pretty interesting stuff with this data. You could determine the results of a contest for who can acquire the most points. Let’s try that.

Whoa, hang on a minute. That’s not right. What just happened here?

MAX() is an aggregate function. What does that mean? It means its job is to take in many rows, perform an operation to combine all the data, and return a single result. We got our single result. But what happened to username? It got aggregated. We asked for two pieces of information that didn’t necessarily have anything to do with each other.

It’s worth pointing out that Postgres would not let you make this mistake.

Let’s use Postgres for now to be safe.

jsmith and mbloggs are tied for second place. Listing users that are tied on the same line, separated by commas would be a better way to see this.

in Postgres or in MySQL,

Can we simplify the SQL at all? What if we want just the user with the most points? What about this:

The username column is now in an aggregate function too, so Postgres shouldn’t complain.

Wow. We’ve managed to shoot ourselves in the foot even with Postgres. Since this is the second time we’ve made a mistake like this, let’s take a moment to dig a bit deeper about why this doesn’t work the way we might think it should in this particular situation.

What is MAX()? It returns the maximum value for a particular column given a group of rows. It’s a piece of statistical information about a GROUP of rows. You don’t need statistics about a single row.

For instance, it’s more apparent that it wouldn’t make sense to do this:

The number of rows doesn’t really have much to do with any of the ids.

Picking a Winner Correctly

So how would you accomplish this? Like this:

-or-

Notice that we use one query (or subquery) to find the highest point value, and a second query to find all users (there could be a tie) with that value. Aggregates aren’t for working with single rows.

Using Aggregate Functions Correctly

So what would be an appropriate way to use aggregate functions and GROUP BY? What if we wanted to look at the amount of participation of the various user levels?

Here we’re dealing with groups of rows that have something in common: level. We group all users of the same level into a single pseudo-row. Note that, just as it doesn’t make sense to use aggregate functions on a single row, it doesn’t make sense to talk about a single column of a group of rows. Users of level 10 don’t have a single points value, they have many. How do you want them aggregated? Do you want to add them up? Concatenate them into a list? Find the highest?

The one sort of gray area/exception in this case is level 8. Because there is only one row in the group, there is technically only one points value, only one id, etc. It might be helpful to think of these cases not as a single row, but as a set or array containing one row. [5] instead of 5.

About Alexander

Alexander has been programming for a very long time. When he met Ruby, it was love at first sight.
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *