Hash cost

From: Dennis Haney <davh(at)diku(dot)dk>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Hash cost
Date: 2004-03-30 13:06:21
Message-ID: 406970CD.2060808@diku.dk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi

Could someone please try and explain why the cost estimator for the hash
is implemented as it is? (cost_hashjoin in costsize.c)
Especially these issues:

First, there is the estimation on the number of rows and their size.
ExecChooseHashTableSize() apparently trusts neither and doubles them.
Thus the function estimates the input relation is 4 times larger than
the rest of the optimizer thinks. Why is that?
And why is this doubling also applied to the size of HashJoinTupleData?
But not applied twice to the size of estimated bytes the hash would use,
a number used in the calculation on the number of batches?

Second, why does the optimizer first guess that there can be 10 values
in a bucket and then afterwards spend a lot of time estimating this
number for use in another calculation? Using numbers that was based on
the guess that there can be 10 values...

Third, the calculation assumes that the most common value is most
dominant by far, but that the other common value mean nothing?

Fourth, the hashfunction does not create any collisions between
non-identical values? And multiple join qualifiers does not affect this
either?

Fifth, a probe most often looks in a chain with the average number of
buckets? I would assume that a lot more time is spent looking in the
chains with the most buckets...

--
Dennis

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2004-03-30 13:16:01 with vs without oids in pg_catalog.*
Previous Message Gavin Sherry 2004-03-30 08:20:08 Re: pg_dump end comment