Re: Internal operations when the planner makes a hash join.

From: "Igor Neyman" <ineyman(at)perceptron(dot)com>
To: "negora" <negora(at)negora(dot)com>, "Scott Carey" <scott(at)richrelevance(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Internal operations when the planner makes a hash join.
Date: 2010-02-24 14:46:31
Message-ID: F4C27E77F7A33E4CA98C19A9DC6722A20598C110@EXCHANGE.corp.perceptron.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> -----Original Message-----
> From: negora [mailto:negora(at)negora(dot)com]
> Sent: Tuesday, February 23, 2010 4:33 PM
> To: Scott Carey
> Cc: Alvaro Herrera; pgsql-performance(at)postgresql(dot)org
> Subject: Re: Internal operations when the planner makes a hash join.
>
> Thank you for explaining me the internal behaviour of the
> PostgreSQL engine. I'll try to look for more information
> about that hash tables. It sounds really really interesting.
> Your information was very useful.
>
> The origin of my doubt resides in the fact that I need to do
> a joint between 3 HUGE tables (millions of registries) and do
> certain operations with the retrieved information. I was
> deciding whether to use one SELECT with 3 JOINs, as I've been
> doing since the beginning, or build a PL/PgSQL function based
> on 3 nested "FOR ... IN SELECT ... LOOP"
> structures which tried to minimize the subsequent table
> searches storing intermediate useful data in arrays
> (curiously, these would act as the hash tables which you
> mention, but in a very very rudimentary way). In a case like
> this one (possibly unable to fit in RAM), Is also JOIN the
> best solution?
>
> Since I've to retrieve such a big amount of columns and
> crossed registries I had started to think that using 1 SELECT
> with 3 JOINs would increase the number of table searches a
> LOT and also "duplicate" the information too much. I mean
> "duplicate" as in this case, where the Factor 1 appears
> millions of times for every Element:
>
> Element 1 | Sub-factor 1 | Factor 1
> Element 2 | Subf-actor 1 | Factor 1
> ...
> Element 12639747465586 | Sub-factor 1 | Factor 1 Element 1 |
> Sub-factor 2 | Factor 1
>
> I hope not to robber you much time but... What do you think
> about it? Is it better either 1 SELECT with 3 JOINs or build
> nested "FOR ... IN SELECT ... LOOP" structures? Could it be
> one of that cases in which I've to choose between either
> higher speed but higher memory consume (3
> JOINs) or lower speed but less memory expense (3 FORs)?
>
> Thanks again and apologizes for extending this topic too much.
>
>
> Scott Carey wrote:
> > On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote:
> >
> >
> >> negora wrote:
> >>
> >>
> >>> According to how I understood the process, the engine
> would get the
> >>> name from the student with ID 1 and would look for the
> name of the
> >>> father with ID 1 in the hashed table. It'd do exactly the
> same with
> >>> the student #2 and father #2. But my big doubt is about
> the 3rd one
> >>> (Anthony). Would the engine "know" that it already had
> retrieved the
> >>> father's name for the student 1 and would avoid searching for it
> >>> into the hashed table (using some kind of internal
> mechanism which
> >>> allows to "re-utilize" the name)? Or would it search into
> the hashed
> >>> table again?<br>
> >>>
> >> The hash table is searched again. But that's fast, because it's a
> >> hash table.
> >>
> >>
> >
> > To answer the question another way, "remembering" that it
> has already seen father A once and tracking that would use a
> hash table to remember that fact.
> >
> > The hash table created by the first scan IS the "remember
> you have seen this father" data structure, optimized for fast
> lookup. So before even looking at the first student, the
> hash table is built so that it is fast to find out if a
> father has been seen before, and if so where that father's
> data is located. Looking this data up is often referred to
> as a "probe" and not a "scan" because it takes just as long
> to do if the hash table has 100 entries or 10000 entries.
> The drawback is that the whole thing has to fit in RAM.
> >
> >
> >
> >> --
> >> Alvaro Herrera
> http://www.CommandPrompt.com/
> >> The PostgreSQL Company - Command Prompt, Inc.
> >>
> >> --
> >> Sent via pgsql-performance mailing list
> >> (pgsql-performance(at)postgresql(dot)org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-performance
> >>
> >
> >
> >
>

So, you are trying to do "nested loop" in PL/PgSQL.
Why not let optimizer decide between "nested loop" and "hash join" based
on your memory settings and statistics collected for objects involved?
I'm pretty sure, it'll be faster than PL/PgSQL 3 nested loops.

Igor Neyman

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message A. Kretschmer 2010-02-24 14:55:36 Re: partitioned tables query not using indexes
Previous Message Kevin Kempter 2010-02-24 14:36:36 partitioned tables query not using indexes