Wednesday, December 29, 2010

“Distinct vs Group By” SQL Talk

I am writing this blog entry to tell people about main differences between these two instructions because any misunderstanding to the true difference between them can cause big problems

That what happened to me actually
I didn’t not know the true difference and as always I have chosen the easy way and rely on Distinct instead of Group By

Let me show an example which I was working on
and that one really helped me understand the difference between these two

Suppose you have a Deal which is associated with several geographies
Each geography has a level representing its level in the tree of geographies. The higher the level, the lower the geography exists in the tree.

The objective was to get all deals in several geographies which I was given their IDS and sort them by depth/level from lowest to highest

I started writing my query to be like that
which is wrong by the way

select Distinct(deals.id) from deals INNER JOIN deals_geographies ON deals.id = deals_geographies.deal_id INNER JOIN geographies ON geographies.id = deals_geographies.geography_id WHERE geographies.id IN (id1, id2, id3) ORDER BY geographies.level DESC

it sounds good at the beginning. If you see it right then you are facing the same problem I was facing before. So please continue reading

The problem here is that inner joining will perform a Cartesian product between deals and geographies and as we have several geographies for each deal, we will get several rows for each deal

When you say DISTINCT without defining the criteria to the DB engine, it will choose any row which could get you the highest geography this deal has or it can get you the lowest geography this deal has

This is an Ambiguous Selection
but it is your fault that you did it that way

Lets see now how group by will solve this

select deals.*, MAX(geographies.level) as max_level from deals INNER JOIN deals_geographies ON deals.id = deals_geographies.deal_id INNER JOIN geographies ON geographies.id = deals_geographies.geography_id WHERE geographies.id IN (id1, id2, id3) ORDER BY max_level DESC group by deals.id

Now I told the DB engine what should the criteria it should use to remove multiple deal rows and which row it should leave which I wanted it to be the one with the largest geography level as this is my desired condition

My last conclusion is that you can use DISTINCT in one of these cases only

  1. you have a 1 to 1 relation between two or several tables and thus no ambiguity will be there
  2. you only care about 1 table fields and you are just using the other table for no more but Filtration which was not the case above as I needed it for ordering

btw, there is a rumor I heard saying that DISTINCT is not standard SQL. Not sure of this info but you can check that yourself and maybe comment and tell me

Hope this blog was useful and enlightening :)

No comments: