Re: HashJoin w/option to unique-ify inner rel

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Subject: Re: HashJoin w/option to unique-ify inner rel
Date: 2009-04-25 03:52:55
Message-ID: 603c8f070904242052k38ba2651pbc7cc289a4cf939d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 24, 2009 at 10:49 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> As far as I can tell, the focus on trying to estimate the number of
>> tuples per bucket is entirely misguided.  Supposing the relation is
>> mostly unique so that the values don't cluster too much, the right
>> answer is (of course) NTUP_PER_BUCKET.
>
> But the entire point of that code is to arrive at a sane estimate
> when the inner relation *isn't* mostly unique and *does* cluster.
> So I think you're being much too hasty to conclude that it's wrong.
>
>> Because the extra tuples that get thrown into the bucket
>> generally don't have the same hash value (or if they did, they would
>> have been in the bucket either way...) and get rejected with a simple
>> integer comparison, which is much cheaper than
>> hash_qual_cost.per_tuple.
>
> Yeah, we are charging more than we ought to for bucket entries that can
> be rejected on the basis of hashcode comparisons.  The difficulty is to
> arrive at a reasonable guess of what fraction of the bucket entries will
> be so rejected, versus those that will incur a comparison-function call.
> I'm leery of assuming there are no hash collisions, which is what you
> seem to be proposing.

Not really (though the chances for small relations are very low).

What bothers me is that estimate_hash_bucketsize() treats the
ndistinct < nbuckets and ndistinct > nbuckets cases as symmetrical,
and they really aren't. When ndistinct < nbuckets, the values are all
going to cluster in a subset of the buckets, and we're not going to be
able to reject much of anything on the basis of hashcode comparisons,
because odds are good that there's only one distinct value in the
bucket, so every tuple we see will require a hash qual evaluation (and
they'll probably all return true). On the other hand, when ndistinct
> nbuckets, we expect multiple distinct values per bucket, and the
high bits of the hash code will frequently be different, and so the
hashcode comparison will be pretty successful in tossing stuff out the
window. So the problem is that the information we have here:

if (ndistinct > nbuckets)
estfract = 1.0 / nbuckets;
else
estfract = 1.0 / ndistinct;

...isn't available to cost_hashjoin(). If the lower half of this
branch is selected, chances are good that the hash quals will have to
be evaluated almost 100% of the time (so the 0.5 that cost_hashjoin
uses is an underestimate). But if the upper half is selected, then
our odds of not needing to evaluate the hash quals are something like
1 - (ndistinct / nbuckets).

The thing to do, or so it seems to me, is to rewrite
estimate_hash_bucketsize() to return the number of hash qual
evaluations per probe rather than the fraction of items in a given
bucket; that's where you have the statistics to make that estimate.
It should be possible to factor in the possibility of hash collisions
as well. If the hash function works totally excellently, the chance
of a hash collision for a single value will be 2^-x, where x is the
number of bits that are used neither to select the bucket nor to
select the batch. You can multiply through by the estimated number of
entries in the bucket and then by some fudge factor. This won't
matter much for small relations but it might for large ones,
especially for large settings of work_mem. (If work_mem is set to a
relatively normal value, the cost of going multi-batch is likely to
blow the hash-join plan out of the water anyway... but Stephen Frost
was telling me at JDcon East that he sometimes sets it to something
like 8GB when he's the only user on his apparently-quite-awesome
hardware...)

...Robert

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tomas 2009-04-25 04:52:41 Re: RFE: Transparent encryption on all fields
Previous Message Tom Lane 2009-04-25 02:49:23 Re: HashJoin w/option to unique-ify inner rel