Geeky is Awesome: Finding the nth row in every group in SQL

Let's say you have the following log table which stores the dates of each access:

TimeStamp	IP	UserName
2005-10-30 10:45:03	172.16.254.10	jlor
2005-10-30 10:46:31	172.16.254.12	kpar
2005-10-31 09:14:13	172.16.254.14	jlor
2005-10-31 09:25:42	172.16.254.16	kpar
2005-10-31 12:41:14	172.16.254.19	jlor
2005-11-01 07:15:15	172.16.254.20	kpar

You are asked to make a report of the last time each user has accessed the system using SQL.

At first you try using GROUP BY but then realize that it's not so simple to include the IP field along with the TimeStamp and UserName. GROUP BY works when you're interested in aggregating every field that is not used to group the records. In other words, you can easily do this:

SELECT UserName, MAX(TimeStamp)
FROM log
GROUP BY UserName

USERNAME	MAX(TIMESTAMP)
jlor	October, 31 2005 12:41:14+0000
kpar	November, 01 2005 07:15:15+0000

But if you also want to show the corresponding IP address of the access with the latest time stamp, you'd have a problem using simple SQL. If you add the IP field in the SELECT statement, you'd end up with the first IP in the table that belongs to the corresponding user, rather than the IP of the latest time stamp.

SELECT UserName, IP, MAX(TimeStamp)
FROM log
GROUP BY UserName

USERNAME	IP	MAX(TIMESTAMP)
jlor	172.16.254.10	October, 31 2005 12:41:14+0000
kpar	172.16.254.12	November, 01 2005 07:15:15+0000

The way to do this is to simulate the GROUP BY statement using more expressiveness methods.

MS SQL Server

In MS SQL Server, this is achieved using the ROW_NUMBER function. This function gives a number for each row (1, 2, 3, ...) which can be used inside a SELECT statement. The cool thing about this function is that the numbering can be made to restart for every different value in a field. So if we used it on the UserName field we'd have the following:

SELECT UserName, IP, TimeStamp, ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY TimeStamp DESC)
FROM log

USERNAME	IP	TIMESTAMP	COLUMN_3
jlor	172.16.254.10	October, 30 2005 10:45:03+0000	1
jlor	172.16.254.14	October, 31 2005 09:14:13+0000	2
jlor	172.16.254.19	October, 31 2005 12:41:14+0000	3
kpar	172.16.254.12	October, 30 2005 10:46:31+0000	1
kpar	172.16.254.16	October, 31 2005 09:25:42+0000	2
kpar	172.16.254.20	November, 01 2005 07:15:15+0000	3

It even orders the rows by user name and it lets you say how you want the rows of each user to be ordered so that you can say how you want the numbering. Using the SQL above, the row with the latest time stamp of each user has a 1 in the last column. This allows us to select it. It will have to be inside a nested query however in order to be used in a WHERE statement.

SELECT UserName, IP, TimeStamp
FROM (
  SELECT UserName, IP, TimeStamp, ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY TimeStamp DESC) AS rank
  FROM log
) AS t
WHERE rank = 1

USERNAME	IP	TIMESTAMP	COLUMN_3
jlor	172.16.254.19	October, 31 2005 12:41:14+0000
kpar	172.16.254.20	November, 01 2005 07:15:15+0000

Notice that you can even find when the second to last time an access was made by changing the 1 in the WHERE statement to a 2.

You can experiment with this in this SQL Fiddle.

MySQL

Unfortunately MySQL doesn't have a function as nifty as ROW_NUMBER so instead we'll have to simulate that using variables. In MySQL you can create variables using the SET statement and then update them within a SELECT statement so that they change for each row, like this:

SET @row_number := 0;
SELECT UserName, IP, TimeStamp, @row_number := @row_number + 1
FROM log

USERNAME	IP	TIMESTAMP	@ROW_NUMBER := @ROW_NUMBER + 1
jlor	172.16.254.10	October, 30 2005 10:45:03+0000	1
kpar	172.16.254.12	October, 30 2005 10:46:31+0000	2
jlor	172.16.254.14	October, 31 2005 09:14:13+0000	3
kpar	172.16.254.16	October, 31 2005 09:25:42+0000	4
jlor	172.16.254.19	October, 31 2005 12:41:14+0000	5
kpar	172.16.254.20	November, 01 2005 07:15:15+0000	6

This is only half the story of course. We want the numbering to restart for every user and we also want this to happen after sorting the rows by user name. We also want the rows belonging to each user to be sorted by time stamp. A simple ORDER BY statement can handle the sorting part:

SET @row_number := 0;
SELECT UserName, IP, TimeStamp, @row_number := @row_number + 1
FROM log
ORDER BY UserName, TimeStamp DESC

USERNAME	IP	TIMESTAMP	@ROW_NUMBER := @ROW_NUMBER + 1
jlor	172.16.254.19	October, 31 2005 12:41:14+0000	1
jlor	172.16.254.14	October, 31 2005 09:14:13+0000	2
jlor	172.16.254.10	October, 30 2005 10:45:03+0000	3
kpar	172.16.254.20	November, 01 2005 07:15:15+0000	4
kpar	172.16.254.16	October, 31 2005 09:25:42+0000	5
kpar	172.16.254.12	October, 30 2005 10:46:31+0000	6

The restarting of numbering is a little less simple. We have to keep track of what the previous value was using another variable and we have to also choose between setting row_number to 1 or to increment it by 1. Here is the code:

SET @row_number := 0;
SET @prev_username := NULL;
SELECT UserName, IP, TimeStamp, @row_number := CASE WHEN UserName = @prev_username THEN @row_number + 1 ELSE 1 END, @prev_username := UserName
FROM log
ORDER BY UserName, TimeStamp DESC

USERNAME	IP	TIMESTAMP	@ROW_NUMBER := CASE WHEN USERNAME = @PREV_USERNAME THEN @ROW_NUMBER + 1 ELSE 1 END	@PREV_USERNAME := USERNAME
jlor	172.16.254.19	October, 31 2005 12:41:14+0000	1	jlor
jlor	172.16.254.14	October, 31 2005 09:14:13+0000	2	jlor
jlor	172.16.254.10	October, 30 2005 10:45:03+0000	3	jlor
kpar	172.16.254.20	November, 01 2005 07:15:15+0000	1	kpar
kpar	172.16.254.16	October, 31 2005 09:25:42+0000	2	kpar
kpar	172.16.254.12	October, 30 2005 10:46:31+0000	3	kpar

The CASE statement selects a value to set row_number. If the current row's user name is the same as the previous one's then the value will be one more than row_number it currently is. Otherwise it is set to 1. After that variable is set, the prev_username variable is set to the current row's user name.

Finally we can now use this to select the latest access for each user.

SET @row_number := 0;
SET @prev_username := NULL;
SELECT UserName, IP, TimeStamp
FROM (
  SELECT UserName, IP, TimeStamp, @row_number := CASE WHEN UserName = @prev_username THEN @row_number + 1 ELSE 1 END AS rank, @prev_username := UserName
  FROM log
  ORDER BY UserName, TimeStamp DESC
) AS t
WHERE rank = 1

USERNAME	IP	TIMESTAMP
jlor	172.16.254.19	October, 31 2005 12:41:14+0000
kpar	172.16.254.20	November, 01 2005 07:15:15+0000

Notice that you can even find when the second to last time an access was made by changing the 1 in the WHERE statement to a 2.

You can experiment with this in this SQL Fiddle.

Geeky is Awesome

Saturday, February 7, 2015

Finding the nth row in every group in SQL

No comments:

Post a Comment