This is a simple tutorial aimed towards those who, like me, completely misunderstood how to, and why to, use joins in SQL. It is not meant to be an advanced reference but a tutorial for beginners who know some SQL but can't understand joins.
Huh?
Let's say that you have a database with tables which are related by foreign keys. For example, a Customers table, a Products table and an Orders table. Each row in the Orders table is linked to a row in the Customers table (the customer who made the order) and to a row in the Products table (the product which was ordered).
Now let's say that you want to query the database to view all the rows in the Orders table but with the customer's name and product's name instead of their primary key. If you learned SQL the same way I did, you'd probably do something like this:
SELECT orders.date, customers.name, products.name
FROM orders, customers, products
WHERE orders.customer = customers.id AND orders.product = products.id;
This is called an implicit join. In fact you are doing an unconditioned join between all 3 tables and then just filtering out the joined rows which are not made of related table rows. What does this mean?
If our 3 tables contain the following data:
Orders:
Id | Date | Customer | Product |
1 | 01/01/01 | 1 | 2 |
2 | 2/02/02 | 3 | 2 |
3 | 03/03/03 | 3 | 3 |
Customers:
Id | Name |
1 | John |
2 | Michael |
3 | Terry |
Products:
Id | Name |
1 | Cheese |
2 | Parrot |
3 | Spam |
Then the query will cause this to happen:
Orders.Date | Customers.Name | Products.Name |
01/01/01 | John | Cheese |
01/01/01 | John | Parrot |
01/01/01 | John | Spam |
01/01/01 | Michael | Cheese |
01/01/01 | Michael | Parrot |
01/01/01 | Michael | Spam |
01/01/01 | Terry | Cheese |
01/01/01 | Terry | Parrot |
01/01/01 | Terry | Spam |
02/02/02 | John | Cheese |
02/02/02 | John | Parrot |
02/02/02 | John | Spam |
02/02/02 | Michael | Cheese |
02/02/02 | Michael | Parrot |
02/02/02 | Michael | Spam |
02/02/02 | Terry | Cheese |
02/02/02 | Terry | Parrot |
02/02/02 | Terry | Spam |
03/03/03 | John | Cheese |
03/03/03 | John | Parrot |
03/03/03 | John | Spam |
03/03/03 | Michael | Cheese |
03/03/03 | Michael | Parrot |
03/03/03 | Michael | Spam |
03/03/03 | Terry | Cheese |
03/03/03 | Terry | Parrot |
03/03/03 | Terry | Spam |
And after generating all that it will then select the rows which make sense relation wise, according to the WHERE statement.
Orders.Date | Customers.Name | Products.Name |
01/01/01 | John | Parrot |
02/02/02 | Terry | Parrot |
03/03/03 | Terry | Spam |
The big table is called a Cartesian product, which means that all possible combinations of rows between the tables are created, yielding a number of rows equal to the multiplication of the number of rows in each table (3 * 3 * 3 = 27 in this case). As you can tell, this will get really huge really quickly as the table get bigger.
Enter explicit joins.
Joins
In SQL, the FROM statement doesn't mean "a list of tables used in the SELECT statement". In the FROM statement you mention just one table. This table could either be one of the tables you created in the database or a "custom" table, such as one which is a joined version of several tables. The JOIN statement in SQL is not a keyword on par with the FROM statement, as I used to think due to the way it is indented in most tutorials I've seen. It is in fact a binary operator between 2 tables, just like "+" and "=" as binary operators.
The JOIN operator takes 2 tables and returns a new table, which is a joined version of the 2 tables. You can then join another table to that new table and keep on adding more tables to the "composite" table. There are several joins available in standard SQL which will be described later, but if we should use the INNER JOIN type in an example, this is how joins are used in general:
SELECT orders.date, customers.name, products.name
FROM (orders INNER JOIN customers ON orders.customer = customers.id) INNER JOIN products ON orders.product = products.id;
As you can see, the join is used between tables and can even be bracketed in order to state which tables should be joined up first. The ON statement is used to immediately join up the rows which are related, avoiding the cartesian product table in the first place. This will therefore be processed much more efficiently as the third table will only be joined up after the first two tables have been joined correctly, rather than joining everything together without a condition.
In the implicit join example, the tables where joined up with an inner join without the ON condition, hence losing all the advantage of joins. Now that you know how joins work, let's see what are the differences between the 4 join types described in w3schools.com.
Join types
The main difference between the join types is how they handle rows which don't match the ON condition.
INNER JOIN:
Strictly follows the ON condition between tables such that any rows in one table which cannot be matched up with another row in the other table will be left out. This is the kind of join we are used to.
SELECT orders.date, customers.name, products.name
FROM (orders INNER JOIN customers ON orders.customer = customers.id) INNER JOIN products ON orders.product = products.id;
results in
Orders.Date | Customers.Name | Products.Name |
01/01/01 | John | Parrot |
02/02/02 | Terry | Parrot |
03/03/03 | Terry | Spam |
As you can see, not all customers or all products where mentioned. Only the ones which were mentioned in the ORDERS table and hence were related in some way were returned.
OUTER JOIN:
This is where things get a bit hairy and hence require that you make an effort to understand. The outer join makes sure that all rows in both tables are returned, even if there is no matching row in the other table. What happens to those tables which have no matching rows? They are joined to NULL values. As long as they are returned, it doesn't matter what they're joined to right?
You can also use LEFT JOIN and RIGHT JOIN to say whether you want to return all the rows of the table on the left of the join operator only or all the rows of the table on the right of the join operator only.
Now in order to use these joins to create a complex joined table, you have to be sure that a join will not return columns with NULL values which will be used in an ON condition of another join.
SELECT orders.date, customers.name
FROM orders RIGHT JOIN customers ON orders.customer = customers.id;
results in
Orders.Date | Customers.Name |
01/01/01 | John |
NULL | Michael |
02/02/02 | Terry |
03/03/03 | Terry |
SELECT orders.date, products.name
FROM orders RIGHT JOIN products ON orders.product = products.id;
results in
Orders.Date | Customers.Name |
NULL | cheese |
02/02/02 | parrot |
01/01/01 | parrot |
03/03/03 | spam |
SELECT orders.date, products.name, customers.name
FROM (orders LEFT JOIN customers ON orders.customer = customers.id) RIGHT JOIN products ON orders.product = products.id;
results in
Orders.Date | Customers.Name | Products.Name |
NULL | NULL | cheese |
02/02/02 | terry | parrot |
01/01/01 | john | parrot |
03/03/03 | terry | spam |
As you can see, depending on how the outer joins are used, we can make a particular table return all its rows, even if it isn't related to any other table.
Final note
I'd like to leave a few notes and links before concluding.
First of all, what I said about the implicit join not being efficient next to an explicit one is not necessarily true since the SQL optimizer may fix it before executing it.
Secondly, if you can avoid using brackets in the the joins it would be better as that will allow the SQL optimizer to make adjustments to the query without it having to make sure to preserve the precedence which you unneccessarily imposed.
In a future post, I will describe a trick which I used to efficiently view multiple one-to-many related tables which I also had trouble with in SQL until recently. But for now I will leave you with these links:
http://www.w3schools.com/sql/sql_join.asp
http://onlamp.com/pub/a/onlamp/2004/09/30/from_clauses.html
http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html