Quick Links

Re: cost_hashjoin

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: cost_hashjoin
Date:	2010-08-30 12:34:21
Message-ID:	AANLkTi=_DJZViKQ5C4n3oUDdBfPiGe=odA6BHpHPt1UN@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Aug 30, 2010 at 10:18 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> cost_hashjoin() has some treatment of what occurs when numbatches > 1
> but that additional cost is not proportional to numbatches.

Because that's not how our hash batching works. We generate two temp
files for each batch, one for the outer and one for the inner. So if
we're batching then every tuple of both the inner and outer tables
(except for ones in the first batch) need to be written once and read
once regardless of the number of batches.

I do think the hash join implementation is a good demonstration of why
C programming is faster at a micro-optimization level but slower at a
macro level. Users of higher level languages would be much more likely
to use any of the many fancier hashing data structures developed in
the last few decades. in particular I think Cuckoo hashing would be
interesting for us.

--
greg

In response to

cost_hashjoin at 2010-08-30 09:18:54 from Simon Riggs

Responses

Re: cost_hashjoin at 2010-08-30 13:49:11 from Simon Riggs

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Simon Riggs	2010-08-30 13:49:11	Re: cost_hashjoin
Previous Message	Fujii Masao	2010-08-30 12:29:19	Re: pg_subtrans keeps bloating up in the standby