Re: Comment fix and question about dshash.c

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Antonin Houska <ah(at)cybertec(dot)at>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Comment fix and question about dshash.c
Date: 2018-10-26 21:03:21
Message-ID: CAEepm=0RMoHPRovc5ugcPmKmaNRFUeZ+3bCf8uXbJ6Fx-1jCuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <ah(at)cybertec(dot)at> wrote:
> 1. The return type of resize() function is void, so I propose part of the
> header comment to be removed:
>
> diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
> index b46f7c4cfd..b2b8fe60e1 100644
> --- a/src/backend/lib/dshash.c
> +++ b/src/backend/lib/dshash.c
> @@ -672,9 +672,7 @@ delete_item(dshash_table *hash_table, dshash_table_item *item)
>
> /*
> * Grow the hash table if necessary to the requested number of buckets. The
> - * requested size must be double some previously observed size. Returns true
> - * if the table was successfully expanded or found to be big enough already
> - * (because another backend expanded it).
> + * requested size must be double some previously observed size.
> *
> * Must be called without any partition lock held.
> */

Thanks, will fix.

> 2. Can anyone please explain this macro?
>
> /* Max entries before we need to grow. Half + quarter = 75% load factor. */
> #define MAX_COUNT_PER_PARTITION(hash_table) \
> (BUCKETS_PER_PARTITION(hash_table->size_log2) / 2 + \
> BUCKETS_PER_PARTITION(hash_table->size_log2) / 4)
>
> I'm failing to understand why the maximum number of hash table entries in a
> partition should be smaller than the number of buckets in that partition.
>
> The fact that MAX_COUNT_PER_PARTITION refers to entries and not buckets is
> obvious from this condition in dshash_find_or_insert()
>
> /* Check if we are getting too full. */
> if (partition->count > MAX_COUNT_PER_PARTITION(hash_table))

Are you saying there is a bug in this logic (which is nbuckets * 0.75
expressed as integer maths), or saying that 0.75 is not a good maximum
load factor? I looked around at a couple of general purpose hash
tables and saw that some used 0.75 and some used 1.0, as a wasted
space-vs-collision trade-off. If I have my maths right[1], with 0.75
you expect to have 75 entries in ~53 buckets, but with 1.0 you expect
to have 100 entries in ~64 buckets. It'd be a fair criticism that
that's arbitrarily different to the choice made for hash joins, and
dynahash's default is 1 (though it's a run-time parameter).

[1] https://math.stackexchange.com/questions/470662/average-number-of-bins-occupied-when-throwing-n-balls-into-n-bins

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-10-26 21:38:28 Re: Comment fix and question about dshash.c
Previous Message Jeff Janes 2018-10-26 20:38:02 Re: Buildfarm failures for hash indexes: buffer leaks