Tuesday, April 6, 2010

Why to And How to use Denormalization

Please read this post first if If you have come directly on this post.

So now as we know the types of Denormalization let's discuss why to and how to use them.

First of all - why to use Denormalization?

For the ideologists Denormalization would be a bad approach, but in practice you would have to use it in one or another way.

For three reasons

1). Performance

This would be the biggest reason to use denormalization. Denormalization significantly reduces the query run time by reducing the number of joins. Understand that each join statement does multiplication of the number of rows of each joining tables. Imagine a join statement between 3 tables that contain millions of records each!! How cool if you can avoid refering at least one of the two tables from the master table? Just have the required value lieing in table 3 in the master table itself and you are saved from referring to that table!! However a great amount of wisdom is required to do this. More to come on this later.

2). Database readability

Not as big reason as performance but sometimes you don't want your database to be very complex, specially in smaller projects. I've seen "Half Technical" clients wanting to be able to access the database without much exercise.

3). Ease of coding

You are saved from writing huge joins for doing simple operations with your denormalized database. Actually I don't know why I'm writing this. Who cares for poor programmers in this cruel world!! They are supposed to be supermen with no feeling!!

Now I will explain "How" to use the denormalization.

So from where to start? The first thing you need to know is "Do not create a denormalized database directly". To do denormalization you should always create a conventional normalized database first. This will give you clear idea of what type and to what extent you actually need denormalization. Remember, Denormalization is not simply opposite to Normalization but it's the next step to the normalization.

1). Reference Centralization Method

As I mentioned in my earlier post sometimes you see some really unnecessary normalization done on database. I would recommend to wisely avoid that. As a DB admin you have to take wise decision between the performance and maintainability. There is no harm to do denormalization for performance if a simple update statement can solve your maintainability problem. More often than not it is wise to not to use over normalized database. I've seen databases which had separate tables even for phone numbers and fax numbers!! Upon asking they gave me a very unconvincing reason but what it did was very painful for me. I had to write huge joins for very simple operations. Not required to mentioned that it slowed down that huge database a great deal.

2). CSV Makeshift Denormalization

This style is recommended if you are not going to need to do join using the reference values. The decision should be visionary otherwise it may cause stay backs for you in office and "enjoy" the company sponsored food!!

3). Scattered Denormalization

Think of Google Buzz here. They list the post description along with the user's name. Think how ofthen they have to do joins on post and user table considering all the conversation threads are real time Ajax driven. And also think how big the user table and post table would be. Now think how cool it would be if they had put the user's name along with the user's ID in the post table itself. They would not have to join to the user table at all!! This might look cool but this has to be a very wise decision of an experienced man. In this case Buzz has to update all the post records whenever user updates his name in the profile. I'm sure Google would not have done this. I would do this only if I have to run the said query VERY VERY often in the system and my hardware resources are very very poor in comparison to Google's.

No comments:

Post a Comment