Tuesday, April 6, 2010

Why to And How to use Denormalization

Please read this post first if If you have come directly on this post.

So now as we know the types of Denormalization let's discuss why to and how to use them.

First of all - why to use Denormalization?

For the ideologists Denormalization would be a bad approach, but in practice you would have to use it in one or another way.

For three reasons

1). Performance

This would be the biggest reason to use denormalization. Denormalization significantly reduces the query run time by reducing the number of joins. Understand that each join statement does multiplication of the number of rows of each joining tables. Imagine a join statement between 3 tables that contain millions of records each!! How cool if you can avoid refering at least one of the two tables from the master table? Just have the required value lieing in table 3 in the master table itself and you are saved from referring to that table!! However a great amount of wisdom is required to do this. More to come on this later.

2). Database readability

Not as big reason as performance but sometimes you don't want your database to be very complex, specially in smaller projects. I've seen "Half Technical" clients wanting to be able to access the database without much exercise.

3). Ease of coding

You are saved from writing huge joins for doing simple operations with your denormalized database. Actually I don't know why I'm writing this. Who cares for poor programmers in this cruel world!! They are supposed to be supermen with no feeling!!

Now I will explain "How" to use the denormalization.

So from where to start? The first thing you need to know is "Do not create a denormalized database directly". To do denormalization you should always create a conventional normalized database first. This will give you clear idea of what type and to what extent you actually need denormalization. Remember, Denormalization is not simply opposite to Normalization but it's the next step to the normalization.

1). Reference Centralization Method

As I mentioned in my earlier post sometimes you see some really unnecessary normalization done on database. I would recommend to wisely avoid that. As a DB admin you have to take wise decision between the performance and maintainability. There is no harm to do denormalization for performance if a simple update statement can solve your maintainability problem. More often than not it is wise to not to use over normalized database. I've seen databases which had separate tables even for phone numbers and fax numbers!! Upon asking they gave me a very unconvincing reason but what it did was very painful for me. I had to write huge joins for very simple operations. Not required to mentioned that it slowed down that huge database a great deal.

2). CSV Makeshift Denormalization

This style is recommended if you are not going to need to do join using the reference values. The decision should be visionary otherwise it may cause stay backs for you in office and "enjoy" the company sponsored food!!

3). Scattered Denormalization

Think of Google Buzz here. They list the post description along with the user's name. Think how ofthen they have to do joins on post and user table considering all the conversation threads are real time Ajax driven. And also think how big the user table and post table would be. Now think how cool it would be if they had put the user's name along with the user's ID in the post table itself. They would not have to join to the user table at all!! This might look cool but this has to be a very wise decision of an experienced man. In this case Buzz has to update all the post records whenever user updates his name in the profile. I'm sure Google would not have done this. I would do this only if I have to run the said query VERY VERY often in the system and my hardware resources are very very poor in comparison to Google's.

Types of Denormalization of database

We all know what is normalization of the database and many of you know what is denormalization as well. What this post is meant for is to define the methods of the denormalization.

There is very little theoretical info available on the web on denormalization despite it being widely used. In fact every programmer would have used denormalization in one way or other in many projects. So since I could not find any theoretical explaination of it I would try to define the types and their prescribed useage myself.

So children, let's start the "Denormalization Class".

There are three types of denormalizations (Please forgive my poor vocabulary if you find the names of the types not much convincing. Suggestions are always welcomed!).

1). Reference Centralization Method

I bet every programmer would have used this method in each database driven project that they have worked on. In this method you simply avoid unnecessary normalization on your database. An example would explain this better.

In a User table you would want to have references to the address and phone numbers. Normally you would simply add address and phone fields in the user table. Believe it or not this is "Reference Centralization Method". If you wanted to normalize it you would need to have address table reference ID in the user table, then phone number ID in the Addresses table to reference the Phones table and so on. So in short, when you decide to not to have an over normalized table - and have some fixed fields instead - you are using Reference Centralization Method.

One can argue that it cannot be termed "Denormalization" just because it's not normalization. But hey, I am here to classify the types of denormalization and I can clearly see this as one of the types. So what if it's so commonly used? I cannot ignore it.

See this link to know when to use it.

2). CSV Makeshift Denormalization

Sometimes you prefer to not to create an association table to link two tables, you put CSV values in one of the master tables just to keep it simple. This method is CSV Makeshift Denormalization.

See this link to know when to use it.

3). Scattered denormalization

Hmm. This one is interesting. This is completely opposite to the ideology of normalization. In this you store a single set of information in multiple places. For example instead of having just User_ID in the Blog_Description table you would store User_Name, User_Rating along with User_ID in parallel to the User table. Thus you have to update the Blog_Description table each time you update the User table. It may look weird but it's actually useful in some cases.

See this link to know when to use it.

Anymore type you can think of? Please suggest.

To keep the post smaller I will write another post to prescribe when to and how to use different types of Denormalizations.