Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Date: 2018-01-26 23:45:37
Message-ID: 20180126234537.dyu6jwnqz7snpq7n@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2017-12-10 23:09:42 +0100, Tomas Vondra wrote:
> 1) randomized hashing, i.e. using the extended hash functions and a
> random seed

Given the lack of extended hash functions for everything this doesn't
seem quite viable. But I think we can make it work quite easily
nonetheless - we can just do the combining with a random seed outside of
the hashtable. We kinda already do so (with a really crappy combiner) in
TupleHashTableHash(), except that we'd actually have to use a random
seed.

This seems like something we should do, just out of defensiveness. It'll
not be as good as persisting the hashfunction's state across calls, but
well, ...

Let me write a patch for that.

> 2) universal hashing [1], i.e. passing the hash through ((a*h+b) mod p)
> formula, where h is "our" hash, and (a,b) are random and p is a prime.
> This can't possibly fix the collisions reported by Todd, because the
> result will remain the same. But it should fix "my" chains of values by
> spreading them a bit (and the a,b,p values are generated on each grow,
> so in the unlikely case where we get a chain, it should disappear after
> a growth)

This seems on the too expensive side of things for my taste to do
unconditionally.

> 3) disabling growth of the hash table when the fillfactor would drop too
> much (e.g. below 0.2)

Yea, I think we should just apply something like this.

> FWIW I do agree the data sets shared in this thread are pretty extreme
> and it doesn't make much sense to slow the regular cases. I'll be
> perfectly happy if we stop the OOM, making those cases fast is a bonus.

Yea, agreed on that. I'm kinda inclined to go for stop-growing in 10,
and so something better in 11. And then later possibly backpatch if
we've grown some confidence?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Guo Xiang Tan 2018-01-26 23:46:23 Re: BUG #15032: Segmentation fault when running a particular query
Previous Message Tomas Vondra 2018-01-26 23:45:18 Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop