Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc: "Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Date: 2017-12-06 23:45:34
Message-ID: 0e91ba4e-5afc-220e-b1f7-56a838aee845@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 12/06/2017 11:55 PM, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2017-12-06 16:38:18 -0500, Todd A. Cook wrote:
>>> I found this problem when I dropped 10.1 into a test environment to see
>>> what would happen. There was no deliberate attempt to break anything.
>
>> Read Thomas' message at: http://archives.postgresql.org/message-id/263b03b1-3e1c-49ca-165a-8ac6751419c4%402ndquadrant.com
>
> I'm confused by Tomas' claim that
>
>>> (essentially hashint8 only ever produces 60% of
>>> values from [0,1000000], which likely increases collision rate).
>
> This is directly contradicted by the simple experiments I've done, eg
>
> regression=# select count(distinct hashint8(v)) from generate_series(0,1000000::int8) v;
> count
> --------
> 999879
> (1 row)
>
> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,1000000::int8) v;
> count
> --------
> 644157
> (1 row)
>
> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,10000000::int8) v;
> count
> ---------
> 1048514
> (1 row)
>
> It's certainly not perfect, but I'm not observing any major failure to
> span the output space.
>

That is not the claim I was making, though. When generating data sets
I've been considering only values where hashint8(v) is between 0 and
1000000 *without* the masking. Otherwise growing the hash table would
break the "chain" of values in the hash table, which would resolve the
issue with SH_GROW_MAX_MOVE.

So you'd have to do something like this:

select count(distinct hashint8(v))
from generate_series(0,1000000::int8) v
where hashint8(v) between 0 and 1024*1024;

but to get some numbers you'd have to also increase the value passed to
generate_series (because it'll filter most of the values).

Which is why I generated the data by an ugly C program, similar to the
attached one. It keeps a small bitmap for generated hashes in the [0,1M]
range, and prints the number of bits set after every 1e9 values processed.

What I do see is this:

i=1000000000 nset=217666
i=2000000000 nset=389526
i=3000000000 nset=525135
i=4000000000 nset=632305
i=5000000000 nset=659574
i=6000000000 nset=659574
i=7000000000 nset=659574
i=8000000000 nset=659574
i=9000000000 nset=659574
...

So somewhere between 4e9 and 5e9 we generate 659574 hashes in the range,
and then never increase it further.

I don't know if this an issue in practice, but it seems that for large
hash tables (on bigint), growing the hash table may not be that great
improvement because we're not really doubling the number of slots we can
hit directly. We'll probably find an empty slot nearby, though.

It was just an observation - I only noticed that because I tried to
construct the adversarial data set on bigint, and couldn't make it work
no matter what.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
duplicities-bigint.c text/x-csrc 1.7 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Jan Schulz 2017-12-07 09:54:20 Re: BUG #14948: cost overflow
Previous Message Tom Lane 2017-12-06 22:55:19 Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop