Saturday, February 7, 2015

Finding the nth row in every group in SQL

Let's say you have the following log table which stores the dates of each access:

TimeStampIPUserName
2005-10-30 10:45:03172.16.254.10jlor
2005-10-30 10:46:31172.16.254.12kpar
2005-10-31 09:14:13172.16.254.14jlor
2005-10-31 09:25:42172.16.254.16kpar
2005-10-31 12:41:14172.16.254.19jlor
2005-11-01 07:15:15172.16.254.20kpar

You are asked to make a report of the last time each user has accessed the system using SQL.

At first you try using GROUP BY but then realize that it's not so simple to include the IP field along with the TimeStamp and UserName. GROUP BY works when you're interested in aggregating every field that is not used to group the records. In other words, you can easily do this:

SELECT UserName, MAX(TimeStamp)
FROM log
GROUP BY UserName

USERNAMEMAX(TIMESTAMP)
jlorOctober, 31 2005 12:41:14+0000
kparNovember, 01 2005 07:15:15+0000

But if you also want to show the corresponding IP address of the access with the latest time stamp, you'd have a problem using simple SQL. If you add the IP field in the SELECT statement, you'd end up with the first IP in the table that belongs to the corresponding user, rather than the IP of the latest time stamp.

SELECT UserName, IP, MAX(TimeStamp)
FROM log
GROUP BY UserName

USERNAMEIPMAX(TIMESTAMP)
jlor172.16.254.10October, 31 2005 12:41:14+0000
kpar172.16.254.12November, 01 2005 07:15:15+0000

The way to do this is to simulate the GROUP BY statement using more expressiveness methods.

MS SQL Server

In MS SQL Server, this is achieved using the ROW_NUMBER function. This function gives a number for each row (1, 2, 3, ...) which can be used inside a SELECT statement. The cool thing about this function is that the numbering can be made to restart for every different value in a field. So if we used it on the UserName field we'd have the following:

SELECT UserName, IP, TimeStamp, ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY TimeStamp DESC)
FROM log

USERNAMEIPTIMESTAMPCOLUMN_3
jlor172.16.254.10October, 30 2005 10:45:03+00001
jlor172.16.254.14October, 31 2005 09:14:13+00002
jlor172.16.254.19October, 31 2005 12:41:14+00003
kpar172.16.254.12October, 30 2005 10:46:31+00001
kpar172.16.254.16October, 31 2005 09:25:42+00002
kpar172.16.254.20November, 01 2005 07:15:15+00003

It even orders the rows by user name and it lets you say how you want the rows of each user to be ordered so that you can say how you want the numbering. Using the SQL above, the row with the latest time stamp of each user has a 1 in the last column. This allows us to select it. It will have to be inside a nested query however in order to be used in a WHERE statement.

SELECT UserName, IP, TimeStamp
FROM (
  SELECT UserName, IP, TimeStamp, ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY TimeStamp DESC) AS rank
  FROM log
) AS t
WHERE rank = 1

USERNAMEIPTIMESTAMPCOLUMN_3
jlor172.16.254.19October, 31 2005 12:41:14+0000
kpar172.16.254.20November, 01 2005 07:15:15+0000

Notice that you can even find when the second to last time an access was made by changing the 1 in the WHERE statement to a 2.

You can experiment with this in this SQL Fiddle.

MySQL
Unfortunately MySQL doesn't have a function as nifty as ROW_NUMBER so instead we'll have to simulate that using variables. In MySQL you can create variables using the SET statement and then update them within a SELECT statement so that they change for each row, like this:

SET @row_number := 0;
SELECT UserName, IP, TimeStamp, @row_number := @row_number + 1
FROM log

USERNAMEIPTIMESTAMP@ROW_NUMBER := @ROW_NUMBER + 1
jlor172.16.254.10October, 30 2005 10:45:03+00001
kpar172.16.254.12October, 30 2005 10:46:31+00002
jlor172.16.254.14October, 31 2005 09:14:13+00003
kpar172.16.254.16October, 31 2005 09:25:42+00004
jlor172.16.254.19October, 31 2005 12:41:14+00005
kpar172.16.254.20November, 01 2005 07:15:15+00006

This is only half the story of course. We want the numbering to restart for every user and we also want this to happen after sorting the rows by user name. We also want the rows belonging to each user to be sorted by time stamp. A simple ORDER BY statement can handle the sorting part:

SET @row_number := 0;
SELECT UserName, IP, TimeStamp, @row_number := @row_number + 1
FROM log
ORDER BY UserName, TimeStamp DESC

USERNAMEIPTIMESTAMP@ROW_NUMBER := @ROW_NUMBER + 1
jlor172.16.254.19October, 31 2005 12:41:14+00001
jlor172.16.254.14October, 31 2005 09:14:13+00002
jlor172.16.254.10October, 30 2005 10:45:03+00003
kpar172.16.254.20November, 01 2005 07:15:15+00004
kpar172.16.254.16October, 31 2005 09:25:42+00005
kpar172.16.254.12October, 30 2005 10:46:31+00006

The restarting of numbering is a little less simple. We have to keep track of what the previous value was using another variable and we have to also choose between setting row_number to 1 or to increment it by 1. Here is the code:

SET @row_number := 0;
SET @prev_username := NULL;
SELECT UserName, IP, TimeStamp, @row_number := CASE WHEN UserName = @prev_username THEN @row_number + 1 ELSE 1 END, @prev_username := UserName
FROM log
ORDER BY UserName, TimeStamp DESC

USERNAMEIPTIMESTAMP@ROW_NUMBER := CASE WHEN USERNAME = @PREV_USERNAME THEN @ROW_NUMBER + 1 ELSE 1 END@PREV_USERNAME := USERNAME
jlor172.16.254.19October, 31 2005 12:41:14+00001jlor
jlor172.16.254.14October, 31 2005 09:14:13+00002jlor
jlor172.16.254.10October, 30 2005 10:45:03+00003jlor
kpar172.16.254.20November, 01 2005 07:15:15+00001kpar
kpar172.16.254.16October, 31 2005 09:25:42+00002kpar
kpar172.16.254.12October, 30 2005 10:46:31+00003kpar

The CASE statement selects a value to set row_number. If the current row's user name is the same as the previous one's then the value will be one more than row_number it currently is. Otherwise it is set to 1. After that variable is set, the prev_username variable is set to the current row's user name.

Finally we can now use this to select the latest access for each user.

SET @row_number := 0;
SET @prev_username := NULL;
SELECT UserName, IP, TimeStamp
FROM (
  SELECT UserName, IP, TimeStamp, @row_number := CASE WHEN UserName = @prev_username THEN @row_number + 1 ELSE 1 END AS rank, @prev_username := UserName
  FROM log
  ORDER BY UserName, TimeStamp DESC
) AS t
WHERE rank = 1

USERNAMEIPTIMESTAMP
jlor172.16.254.19October, 31 2005 12:41:14+0000
kpar172.16.254.20November, 01 2005 07:15:15+0000

Notice that you can even find when the second to last time an access was made by changing the 1 in the WHERE statement to a 2.

You can experiment with this in this SQL Fiddle.