Re: Hash Join cost estimates

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hash Join cost estimates
Date: 2013-03-30 20:52:42
Message-ID: 20130330205242.GZ4361@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> I think the point is that there may *not* be few hash collisions ...

Right, but that's actually all entirely based on concerns over there
being duplicates (hence why we consider the MCVs and ndistinct), which
makes *some* sense given that we currently have a single linked-list in
each bucket into which any dups are placed.

It occurs to me that we could get away from this by having a 2-level
system. We hash to a bucket which contains a linked list of linked
lists. The bucket list would be for actual collisions (which we don't
consider at all in the current bucket estimating) and each of those
entries would be a linked list of duplicates for that entry in the
bucket. This would add some small cost for building the hash, since we
would have to step through each entry in the bucket to determine if the
entry being added is a new entry or not, but not very much unless we're
worried about lots of collisions happening (which I certainly hope isn't
the case). Hash tables are generally expected to take more effort to
build initially anyway, so I don't see a problem putting more logic
there. Also, we could skip checking each entry in the bucket when the
input is known to be unique and instead just skip to the end of the
bucket since the new entry can't match any existing.

We could still work through the bucketing logic and add some costing to
that case for those situations where we are hashing on only one key of a
multi-key join and we expect a lot of duplicates to exist. I'm not sure
how much that happens though- I would hope that we would use a composite
hash key most of the time that we have multi-key joins that use hash
tables.

Thoughts?

Thanks,

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-03-30 21:07:02 Re: Hash Join cost estimates
Previous Message Stephen Frost 2013-03-30 20:15:25 Re: Hash Join cost estimates