I am writing this blog entry to tell people about main differences between these two instructions because any misunderstanding to the true difference between them can cause big problems
That what happened to me actually    
I didn’t not know the true difference and as always I have chosen the easy way and rely on Distinct instead of Group By
Let me show an example which I was working on    
and that one really helped me understand the difference between these two
Suppose you have a Deal which is associated with several geographies    
Each geography has a level representing its level in the tree of geographies. The higher the level, the lower the geography exists in the tree.
The objective was to get all deals in several geographies which I was given their IDS and sort them by depth/level from lowest to highest
I started writing my query to be like that    
which is wrong by the way
select Distinct(deals.id) from deals INNER JOIN deals_geographies ON deals.id = deals_geographies.deal_id INNER JOIN geographies ON geographies.id = deals_geographies.geography_id WHERE geographies.id IN (id1, id2, id3) ORDER BY geographies.level DESC
it sounds good at the beginning. If you see it right then you are facing the same problem I was facing before. So please continue reading
The problem here is that inner joining will perform a Cartesian product between deals and geographies and as we have several geographies for each deal, we will get several rows for each deal
When you say DISTINCT without defining the criteria to the DB engine, it will choose any row which could get you the highest geography this deal has or it can get you the lowest geography this deal has
This is an Ambiguous Selection     
but it is your fault that you did it that way
Lets see now how group by will solve this
select deals.*, MAX(geographies.level) as max_level from deals INNER JOIN deals_geographies ON deals.id = deals_geographies.deal_id INNER JOIN geographies ON geographies.id = deals_geographies.geography_id WHERE geographies.id IN (id1, id2, id3) ORDER BY max_level DESC group by deals.id
Now I told the DB engine what should the criteria it should use to remove multiple deal rows and which row it should leave which I wanted it to be the one with the largest geography level as this is my desired condition
My last conclusion is that you can use DISTINCT in one of these cases only
- you have a 1 to 1 relation between two or several tables and thus no ambiguity will be there
- you only care about 1 table fields and you are just using the other table for no more but Filtration which was not the case above as I needed it for ordering
btw, there is a rumor I heard saying that DISTINCT is not standard SQL. Not sure of this info but you can check that yourself and maybe comment and tell me
Hope this blog was useful and enlightening :)
 
No comments:
Post a Comment