Wednesday, March 30, 2011

Database Normalization (1-3 NF)

This is a tutorial for those who are confused about the normal forms due to the extreme confusion you find on the web about the subject. If you want to know what normalization is and why to do it, wikipedia has a great article detailing this information:
http://en.wikipedia.org/wiki/Database_normalization

1NF
1NF is arguably the most ambiguous and confusing normal form on the web. The first normal form is just about making multi-valued fields organized into multiple rows. There are two types of multi-valued fields in unnormalized tables:

This-

IdStudentSubject 1Subject 2
1HarryCharmsPotions
2RonCharmsPotions

And this-

IdStudentSubjects
1HarryCharms, Potions
2RonCharms, Potions

Or as is shown in most examples, this-

IdStudentSubject
1HarryCharms
Potions
2RonCharms
Potions

In both cases we are tying to shove in a list of values, that is a one-to-many or many-to-many relationship with the row, into the row itself. For reasons detailed in the wikipedia article, this should be fixed by creating a separate row for each value and repeating the values in the others fields, like so:

IdStudentSubject
1HarryCharms
1HarryPotions
2RonCharms
2RonPotions

The table is now in 1NF were it not for the primary key losing its uniqueness. The id field must be unique in order to identify each distinct student. To solve this we can do one of two things:

Either we create an additional key field for the subjects and make the primary key of the whole table be a composite key of both student and subject ids, thus making the composite key unique-

StudIdStudentSubjIdSubject
1Harry1Charms
1Harry2Potions
2Ron1Charms
2Ron2Potions

Or we could just prepare for the other normal forms and start splitting the tables now as is usually done in examples-

StudIdStudentSubjId
1Harry1
1Harry2
2Ron1
2Ron2

SubjIdSubject
1Charms
2Potions

Notice that we still need to make the foreign key in the students table part of the primary key in order to preserve uniqueness. If you're wondering why we didn't use a weak entity (bridge table / junction table), that's because it's done in 2NF.

If we had more than one multi-valued field, we'd just create a Cartesian product, like so:

StudIdStudentSubjIdSubjectTeacIdTeacher
1Harry1Charms1Filius
1Harry2Potions2Slughorn
1Harry2Potions3Snape
2Ron1Charms1Filius
2Ron2Potions2Slughorn
2Ron2Potions3Snape

Resulting in 3 tables:

StudIdStudentSubjIdTeacId
1Harry11
1Harry22
1Harry23
2Ron11
2Ron22
2Ron23

SubjIdSubject
1Charms
2Potions

TeacIdTeacher
1Filius
2Slughorn
3Snape

Complete example:

IdStudentSubjIdSubjectRoomRoomHallTeacIdTeacher
1Harry1Charms101A1Filius
2Potions202B2Slughorn
3Snape
2Ron1Charms101A1Filius
2Potions202B2Slughorn
3Snape

Turns into:

StudIdStudentSubjIdSubjectRoomRoomHallTeacIdTeacher
1Harry1Charms101A1Filius
1Harry2Potions202B2Slughorn
1Harry2Potions202B3Snape
2Ron1Charms101A1Filius
2Ron2Potions202B2Slughorn
2Ron2Potions202B3Snape

Notice that the room is assumed to be dependant on the subject such that each subject is taught in its own room.

We may opt to leave the table as it is as it quite complex to break down into smaller tables without any guidance. However if we are to break the table down, the room information would be included in with the subject table since we said that the room is dependant on it, yielding the following:

StudIdStudentSubjIdTeacId
1Harry11
1Harry22
1Harry23
2Ron11
2Ron22
2Ron23

SubjIdSubjectRoomRoomHall
1Charms101A
2Potions202B

TeacIdTeacher
1Filius
2Slughorn
3Snape

2NF
2NF is only applicable on tables with composite keys. If a table does not have a composite key, then it is already in 2NF. To make a table in 2NF, first you make sure it is in 1NF and then you split it into separate tables depending on which part of the composite key the fields depend on. For example in the students' table above, the student name does not depend on the subject id, it depends only on part of the composite key, that is, the student id. So the student name field should go into a separate table which describes the students (together with the primary key of course).

StudIdSubjId
11
12
21
22

StudIdStudent
1Harry
2Ron

SubjIdSubject
1Charms
2Potions

And there is it, the weak entity we are so used to in many to many relationships.

Complete example (following from previous):

StudIdStudentSubjIdTeacId
1Harry11
1Harry22
1Harry23
2Ron11
2Ron22
2Ron23

Turns into:

StudIdSubjIdTeacId
111
122
123
211
222
223

StudIdStudent
1Harry
2Ron

Hence yielding:

StudIdSubjIdTeacId
111
122
123
211
222
223

StudIdStudent
1Harry
2Ron

SubjIdSubjectRoomRoomHall
1Charms101A
2Potions202B

TeacIdTeacher
1Filius
2Slughorn
3Snape

Notice that if in 1NF we did not break down the table, we'd result with the same set of tables by now.

3NF
3NF is the normal form we are used to. All we do is check that every field in a 2NF table depends directly on the primary key. If it doesn't or if it depends on a non-primary key field, you place it in its own table. For example if we had the following table:

SubjIdSubjectRoomRoomHall
1Charms101A
2Potions202B

The RoomHall field is directly dependent on the Room field and not on the SubjId primary key field, so the RoomHall field should go into a table on its own together with the Room field. In fact the room where the subject is thought is not a direct property of the subjects entity but is an entity on its own and hence should be separated into a rooms entity and only referenced by a foreign key.

SubjIdSubjectRoom
1Charms101
2Potions202

RoomRoomHall
101A
202B

Complete example (following from previous):

SubjId Subject Room RoomHall
1 Charms 101 A
2 Potions 202 B

Turns into:

SubjId Subject Room
1 Charms 101
2 Potions 202

Room RoomHall
101 A
202 B

Hence yielding:

StudId SubjId TeacId
1 1 1
1 2 2
1 2 3
2 1 1
2 2 2
2 2 3

StudId Student
1 Harry
2 Ron

SubjId Subject Room
1 Charms 101
2 Potions 202

Room RoomHall
101 A
202 B

TeacId Teacher
1 Filius
2 Slughorn
3 Snape

Links
http://portal.dfpug.de/dFPUG/Dokumente/Partner/Hentzenwerke/Visual%20FoxPro%20Certification%20Exam%20Study%20Guide%20Chapter%2002.pdf

Monday, March 14, 2011

Reading Related Tables As Nested Rows

Problem
A problem I used to face with relational databases is how to handle 1-to-many or many-to-many rows in SQL select statements. Let's say you have 2 tables with a many-to-many relationship between them, such as a Movies table and an Actors table. How do I present a list of movies together with their associated actors?

The naive way to do this is by using nested loops with a query for movies in the top loop and a query for actors in the second loop. In PHP it would look something like...

$moviesResult = mysql_query("SELECT id, title FROM movies"); 
while($moviesRow = mysql_fetch_assoc($result)) 
{ 
    echo("<h1>" . $moviesRow['title'] . "</h1>"); 

    echo("<ul>"); 
    $actorsResult = mysql_query("SELECT actors.name FROM actors INNER JOIN movies_actors ON actors.id = movies_actors.actorid WHERE movies_actors.movieid = " . $moviesRow['id'] . ";"); 
    while($actorsRow = mysql_fetch_assoc($actorsResult)) 
    { 
        echo("<li>" . $actorsRow['name'] . "</li>"); 
    } 
    echo("</ul>"); 
}

This approach however will kill your database. Ideally you should minimize the number of queries sent.

Another approach would be to join the movies table with the actors table and then read it with nested loops.

$result = mysql_query("SELECT movies.id AS id movies.title AS title, actors.name AS actor FROM movies INNER JOIN movies_actors ON movies.id = movies_actors.movie INNER JOIN actors ON actors.id = movies_actors.actor"); 
$row = mysql_fetch_assoc($result); 
while($row) 
{ 
    $currMovieId = $row['id']; 
    echo("<h1>" . $row['title'] . "</h1>"); 
    echo("<ul>"); 
    do 
    { 
        echo("<li>" . $row['name'] . "</li>"); 
    } while ($row = mysql_fetch_assoc($result) && $row['id'] == $currMovieId);
    echo("</ul>"); 
}

This works but as soon as you add another table to the join, determining which rows contain new information can be a nightmare, not to mention all the rows which just contain repeated information due to the cartesian product. For example:

IdTitleActorGenre
1The MatrixKeanu ReevesAction
1The MatrixKeanu ReevesAdventure
1The MatrixKeanu ReevesSci-Fi
1The MatrixLaurence FishburneAction
1The MatrixLaurence FishburneAdventure
1The MatrixLaurence FishburneSci-Fi
1The MatrixCarrie-Anne MossAction
1The MatrixCarrie-Anne MossAdventure
1The MatrixCarrie-Anne MossSci-Fi
2The Matrix ReloadedKeanu ReevesAction
2The Matrix ReloadedKeanu ReevesAdventure
2The Matrix ReloadedKeanu ReevesSci-Fi
2The Matrix ReloadedLaurence FishburneAction
2The Matrix ReloadedLaurence FishburneAdventure
2The Matrix ReloadedLaurence FishburneSci-Fi
2The Matrix ReloadedCarrie-Anne MossAction
2The Matrix ReloadedCarrie-Anne MossAdventure
2The Matrix ReloadedCarrie-Anne MossSci-Fi
3The Matrix RevolutionsKeanu ReevesAction
3The Matrix RevolutionsKeanu ReevesAdventure
3The Matrix RevolutionsKeanu ReevesSci-Fi
3The Matrix RevolutionsLaurence FishburneAction
3The Matrix RevolutionsLaurence FishburneAdventure
3The Matrix RevolutionsLaurence FishburneSci-Fi
3The Matrix RevolutionsCarrie-Anne MossAction
3The Matrix RevolutionsCarrie-Anne MossAdventure
3The Matrix RevolutionsCarrie-Anne MossSci-Fi

Are we supposed to receive all those rows just to display the below?

1The MatrixKeanu Reeves, Laurence Fishburne, Carrie-Anne MossAction, Adventure, Sci-Fi
2The Matrix ReloadedKeanu Reeves, Laurence Fishburne, Carrie-Anne MossAction, Adventure, Sci-Fi
3The Matrix RevolutionsKeanu Reeves, Laurence Fishburne, Carrie-Anne MossAction, Adventure, Sci-Fi

Solution
The best solution I found is a hybrid of the above 2 solutions. You make a different query for each table and then read all the rows returned by each query, placing the information in the list it belongs to.

$moviesResult = mysql_query("SELECT id, title FROM movies;"); 
$actorsResult = mysql_query("SELECT movies_actors.movieid AS movieid, actors.name AS name FROM actors INNER JOIN movies_actors ON actors.id = movies_actors.actorid;");
$genresResult = mysql_query("SELECT movies_genres.movieid AS movieid, genres.name AS name FROM genres INNER JOIN movies_genres ON genres.id = movies_genres.genreid;"); 

$actorRow = mysql_fetch_assoc($actorsResult); 
$genreRow = mysql_fetch_assoc($genresResult); 
while($movieRow = mysql_fetch_assoc($moviesResult)) 
{ 
    $currMovieId = $movieRow['id']; 
    echo("<h1>" . $movieRow['title'] . "</h1>"); 

    echo("<ul>"); 
    while ($actorRow && $actorRow['movieid'] == $currMovieId) 
    { 
        echo("<li>" . $row['name'] . "</li>"); 
        $actorRow = mysql_fetch_assoc($actorsResult); 
    } 
    echo("</ul>"); 

    echo("<ul>"); 
    while ($genreRow && $genreRow['movieid'] == $currMovieId) 
    { 
        echo("<li>" . $row['name'] . "</li>"); 
        $genreRow = mysql_fetch_assoc($genresResult); 
    } 
    echo("</ul>"); 
}

In this way we only make 3 queries, equal to the number of tables, and we only read as many rows as needed.

Sunday, March 13, 2011

SQL Joins Tutorial

This is a simple tutorial aimed towards those who, like me, completely misunderstood how to, and why to, use joins in SQL. It is not meant to be an advanced reference but a tutorial for beginners who know some SQL but can't understand joins.

Huh?
Let's say that you have a database with tables which are related by foreign keys. For example, a Customers table, a Products table and an Orders table. Each row in the Orders table is linked to a row in the Customers table (the customer who made the order) and to a row in the Products table (the product which was ordered).

Now let's say that you want to query the database to view all the rows in the Orders table but with the customer's name and product's name instead of their primary key. If you learned SQL the same way I did, you'd probably do something like this:

SELECT orders.date, customers.name, products.name
FROM orders, customers, products
WHERE orders.customer = customers.id AND orders.product = products.id;

This is called an implicit join. In fact you are doing an unconditioned join between all 3 tables and then just filtering out the joined rows which are not made of related table rows. What does this mean?

If our 3 tables contain the following data:

Orders:
IdDateCustomerProduct
101/01/0112
22/02/0232
303/03/0333

Customers:
IdName
1John
2Michael
3Terry

Products:
IdName
1Cheese
2Parrot
3Spam

Then the query will cause this to happen:

Orders.DateCustomers.NameProducts.Name
01/01/01JohnCheese
01/01/01JohnParrot
01/01/01JohnSpam
01/01/01MichaelCheese
01/01/01MichaelParrot
01/01/01MichaelSpam
01/01/01TerryCheese
01/01/01TerryParrot
01/01/01TerrySpam
02/02/02JohnCheese
02/02/02JohnParrot
02/02/02JohnSpam
02/02/02MichaelCheese
02/02/02MichaelParrot
02/02/02MichaelSpam
02/02/02TerryCheese
02/02/02TerryParrot
02/02/02TerrySpam
03/03/03JohnCheese
03/03/03JohnParrot
03/03/03JohnSpam
03/03/03MichaelCheese
03/03/03MichaelParrot
03/03/03MichaelSpam
03/03/03TerryCheese
03/03/03TerryParrot
03/03/03TerrySpam

And after generating all that it will then select the rows which make sense relation wise, according to the WHERE statement.

Orders.DateCustomers.NameProducts.Name
01/01/01JohnParrot
02/02/02TerryParrot
03/03/03TerrySpam

The big table is called a Cartesian product, which means that all possible combinations of rows between the tables are created, yielding a number of rows equal to the multiplication of the number of rows in each table (3 * 3 * 3 = 27 in this case). As you can tell, this will get really huge really quickly as the table get bigger.

Enter explicit joins.

Joins
In SQL, the FROM statement doesn't mean "a list of tables used in the SELECT statement". In the FROM statement you mention just one table. This table could either be one of the tables you created in the database or a "custom" table, such as one which is a joined version of several tables. The JOIN statement in SQL is not a keyword on par with the FROM statement, as I used to think due to the way it is indented in most tutorials I've seen. It is in fact a binary operator between 2 tables, just like "+" and "=" as binary operators.

The JOIN operator takes 2 tables and returns a new table, which is a joined version of the 2 tables. You can then join another table to that new table and keep on adding more tables to the "composite" table. There are several joins available in standard SQL which will be described later, but if we should use the INNER JOIN type in an example, this is how joins are used in general:

SELECT orders.date, customers.name, products.name
FROM (orders INNER JOIN customers ON orders.customer = customers.id) INNER JOIN products ON orders.product = products.id;

As you can see, the join is used between tables and can even be bracketed in order to state which tables should be joined up first. The ON statement is used to immediately join up the rows which are related, avoiding the cartesian product table in the first place. This will therefore be processed much more efficiently as the third table will only be joined up after the first two tables have been joined correctly, rather than joining everything together without a condition.

In the implicit join example, the tables where joined up with an inner join without the ON condition, hence losing all the advantage of joins. Now that you know how joins work, let's see what are the differences between the 4 join types described in w3schools.com.

Join types
The main difference between the join types is how they handle rows which don't match the ON condition.

INNER JOIN:
Strictly follows the ON condition between tables such that any rows in one table which cannot be matched up with another row in the other table will be left out. This is the kind of join we are used to.

SELECT orders.date, customers.name, products.name
FROM (orders INNER JOIN customers ON orders.customer = customers.id) INNER JOIN products ON orders.product = products.id;

results in

Orders.DateCustomers.NameProducts.Name
01/01/01JohnParrot
02/02/02TerryParrot
03/03/03TerrySpam

As you can see, not all customers or all products where mentioned. Only the ones which were mentioned in the ORDERS table and hence were related in some way were returned.

OUTER JOIN:
This is where things get a bit hairy and hence require that you make an effort to understand. The outer join makes sure that all rows in both tables are returned, even if there is no matching row in the other table. What happens to those tables which have no matching rows? They are joined to NULL values. As long as they are returned, it doesn't matter what they're joined to right?

You can also use LEFT JOIN and RIGHT JOIN to say whether you want to return all the rows of the table on the left of the join operator only or all the rows of the table on the right of the join operator only.

Now in order to use these joins to create a complex joined table, you have to be sure that a join will not return columns with NULL values which will be used in an ON condition of another join.

SELECT orders.date, customers.name
FROM orders RIGHT JOIN customers ON orders.customer = customers.id;

results in

Orders.DateCustomers.Name
01/01/01John
NULLMichael
02/02/02Terry
03/03/03Terry

SELECT orders.date, products.name
FROM orders RIGHT JOIN products ON orders.product = products.id;

results in

Orders.DateCustomers.Name
NULLcheese
02/02/02parrot
01/01/01parrot
03/03/03spam

SELECT orders.date, products.name, customers.name
FROM (orders LEFT JOIN customers ON orders.customer = customers.id) RIGHT JOIN products ON orders.product = products.id;

results in

Orders.DateCustomers.NameProducts.Name
NULLNULLcheese
02/02/02terryparrot
01/01/01johnparrot
03/03/03terryspam

As you can see, depending on how the outer joins are used, we can make a particular table return all its rows, even if it isn't related to any other table.

Final note
I'd like to leave a few notes and links before concluding.

First of all, what I said about the implicit join not being efficient next to an explicit one is not necessarily true since the SQL optimizer may fix it before executing it.

Secondly, if you can avoid using brackets in the the joins it would be better as that will allow the SQL optimizer to make adjustments to the query without it having to make sure to preserve the precedence which you unneccessarily imposed.

In a future post, I will describe a trick which I used to efficiently view multiple one-to-many related tables which I also had trouble with in SQL until recently. But for now I will leave you with these links:

http://www.w3schools.com/sql/sql_join.asp
http://onlamp.com/pub/a/onlamp/2004/09/30/from_clauses.html
http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html