Skip site navigation (1) Skip section navigation (2)

Re: Internal operations when the planner makes a hash join.

From: "Igor Neyman" <ineyman(at)perceptron(dot)com>
To: "negora" <negora(at)negora(dot)com>,"Scott Carey" <scott(at)richrelevance(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>,<pgsql-performance(at)postgresql(dot)org>
Subject: Re: Internal operations when the planner makes a hash join.
Date: 2010-02-24 14:46:31
Message-ID: F4C27E77F7A33E4CA98C19A9DC6722A20598C110@EXCHANGE.corp.perceptron.com (view raw or flat)
Thread:
Lists: pgsql-performance
 

> -----Original Message-----
> From: negora [mailto:negora(at)negora(dot)com] 
> Sent: Tuesday, February 23, 2010 4:33 PM
> To: Scott Carey
> Cc: Alvaro Herrera; pgsql-performance(at)postgresql(dot)org
> Subject: Re: Internal operations when the planner makes a hash join.
> 
> Thank you for explaining me the internal behaviour of the 
> PostgreSQL engine. I'll try to look for more information 
> about that hash tables. It sounds really really interesting. 
> Your information was very useful.
> 
> The origin of my doubt resides in the fact that I need to do 
> a joint between 3 HUGE tables (millions of registries) and do 
> certain operations with the retrieved information. I was 
> deciding whether to use one SELECT with 3 JOINs, as I've been 
> doing since the beginning, or build a PL/PgSQL function based 
> on 3 nested "FOR ... IN SELECT ... LOOP" 
> structures which tried to minimize the subsequent table 
> searches storing intermediate useful data in arrays 
> (curiously, these would act as the hash tables which you 
> mention, but in a very very rudimentary way). In a case like 
> this one (possibly unable to fit in RAM), Is also JOIN the 
> best solution?
> 
> Since I've to retrieve such a big amount of columns and 
> crossed registries I had started to think that using 1 SELECT 
> with 3 JOINs would increase the number of table searches a 
> LOT and also "duplicate" the information too much. I mean 
> "duplicate" as in this case, where the Factor 1 appears 
> millions of times for every Element:
> 
> Element 1 | Sub-factor 1 | Factor 1
> Element 2 | Subf-actor 1 | Factor 1
> ...
> Element 12639747465586 | Sub-factor 1 | Factor 1 Element 1 | 
> Sub-factor 2 | Factor 1
> 
> I hope not to robber you much time but... What do you think 
> about it? Is it better either 1 SELECT with 3 JOINs or build 
> nested "FOR ... IN SELECT ... LOOP" structures? Could it be 
> one of that cases in which I've to choose between either 
> higher speed but higher memory consume (3
> JOINs) or lower speed but less memory expense (3 FORs)?
> 
> Thanks again and apologizes for extending this topic too much.
> 
> 
> Scott Carey wrote:
> > On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote:
> >
> >   
> >> negora wrote:
> >>
> >>     
> >>> According to how I understood the process, the engine 
> would get the 
> >>> name from the student with ID 1 and would look for the 
> name of the 
> >>> father with ID 1 in the hashed table. It'd do exactly the 
> same with 
> >>> the student #2 and father #2. But my big doubt is about 
> the 3rd one 
> >>> (Anthony). Would the engine "know" that it already had 
> retrieved the 
> >>> father's name for the student 1 and would avoid searching for it 
> >>> into the hashed table (using some kind of internal 
> mechanism which 
> >>> allows to "re-utilize" the name)? Or would it search into 
> the hashed 
> >>> table again?<br>
> >>>       
> >> The hash table is searched again.  But that's fast, because it's a 
> >> hash table.
> >>
> >>     
> >
> > To answer the question another way, "remembering" that it 
> has already seen father A once and tracking that would use a 
> hash table to remember that fact.  
> >
> > The hash table created by the first scan IS the "remember 
> you have seen this father" data structure, optimized for fast 
> lookup.  So before even looking at the first student, the 
> hash table is built so that it is fast to find out if a 
> father has been seen before, and if so where that father's 
> data is located.  Looking this data up is often referred to 
> as a "probe" and not a "scan" because it takes just as long 
> to do if the hash table has 100 entries or 10000 entries.  
> The drawback is that the whole thing has to fit in RAM.
> >
> >
> >   
> >> -- 
> >> Alvaro Herrera                                
> http://www.CommandPrompt.com/
> >> The PostgreSQL Company - Command Prompt, Inc.
> >>
> >> --
> >> Sent via pgsql-performance mailing list 
> >> (pgsql-performance(at)postgresql(dot)org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-performance
> >>     
> >
> >
> >   
> 

So, you are trying to do "nested loop" in PL/PgSQL.
Why not let optimizer decide between "nested loop" and "hash join" based
on your memory settings and statistics collected for objects involved?
I'm pretty sure, it'll be faster than PL/PgSQL 3 nested loops.

Igor Neyman

In response to

pgsql-performance by date

Next:From: A. KretschmerDate: 2010-02-24 14:55:36
Subject: Re: partitioned tables query not using indexes
Previous:From: Kevin KempterDate: 2010-02-24 14:36:36
Subject: partitioned tables query not using indexes

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group