Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc: "Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14932: SELECT DISTINCT val FROM table gets stuck in an infinite loop
Date: 2017-12-10 23:14:54
Message-ID: 29a04fd9-abe6-7c5f-dadc-a80459890f53@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 12/07/2017 12:45 AM, Tomas Vondra wrote:
>
>
> On 12/06/2017 11:55 PM, Tom Lane wrote:
>> Andres Freund <andres(at)anarazel(dot)de> writes:
>>> On 2017-12-06 16:38:18 -0500, Todd A. Cook wrote:
>>>> I found this problem when I dropped 10.1 into a test environment to see
>>>> what would happen. There was no deliberate attempt to break anything.
>>
>>> Read Thomas' message at: http://archives.postgresql.org/message-id/263b03b1-3e1c-49ca-165a-8ac6751419c4%402ndquadrant.com
>>
>> I'm confused by Tomas' claim that
>>
>>>> (essentially hashint8 only ever produces 60% of
>>>> values from [0,1000000], which likely increases collision rate).
>>
>> This is directly contradicted by the simple experiments I've done, eg
>>
>> regression=# select count(distinct hashint8(v)) from generate_series(0,1000000::int8) v;
>> count
>> --------
>> 999879
>> (1 row)
>>
>> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,1000000::int8) v;
>> count
>> --------
>> 644157
>> (1 row)
>>
>> regression=# select count(distinct hashint8(v) & (1024*1024-1)) from generate_series(0,10000000::int8) v;
>> count
>> ---------
>> 1048514
>> (1 row)
>>
>> It's certainly not perfect, but I'm not observing any major failure to
>> span the output space.
>>
>
> That is not the claim I was making, though. When generating data sets
> I've been considering only values where hashint8(v) is between 0 and
> 1000000 *without* the masking. Otherwise growing the hash table would
> break the "chain" of values in the hash table, which would resolve the
> issue with SH_GROW_MAX_MOVE.
>
> So you'd have to do something like this:
>
> select count(distinct hashint8(v))
> from generate_series(0,1000000::int8) v
> where hashint8(v) between 0 and 1024*1024;
>
> but to get some numbers you'd have to also increase the value passed to
> generate_series (because it'll filter most of the values).
>
> Which is why I generated the data by an ugly C program, similar to the
> attached one. It keeps a small bitmap for generated hashes in the [0,1M]
> range, and prints the number of bits set after every 1e9 values processed.
>
> What I do see is this:
>
> i=1000000000 nset=217666
> i=2000000000 nset=389526
> i=3000000000 nset=525135
> i=4000000000 nset=632305
> i=5000000000 nset=659574
> i=6000000000 nset=659574
> i=7000000000 nset=659574
> i=8000000000 nset=659574
> i=9000000000 nset=659574
> ...
>
> So somewhere between 4e9 and 5e9 we generate 659574 hashes in the range,
> and then never increase it further.
>
> I don't know if this an issue in practice, but it seems that for large
> hash tables (on bigint), growing the hash table may not be that great
> improvement because we're not really doubling the number of slots we can
> hit directly. We'll probably find an empty slot nearby, though.
>
> It was just an observation - I only noticed that because I tried to
> construct the adversarial data set on bigint, and couldn't make it work
> no matter what.
>

I've done a simple experiment today - computing a hash for every uint32
value, and checking how many distinct hash values that produces. For the
4.2 billion values the results look like this:

second key ndistinct ndistinct/nkeys
3.380 i=100000000 nset=98829159 0.99
6.240 i=200000000 nset=195351181 0.98
9.106 i=300000000 nset=289623103 0.97
11.932 i=400000000 nset=381695003 0.95
14.814 i=500000000 nset=471621355 0.94
17.706 i=600000000 nset=559452278 0.93
20.577 i=700000000 nset=645242762 0.92
23.496 i=800000000 nset=729036217 0.91
26.460 i=900000000 nset=810879821 0.90
29.380 i=1000000000 nset=890818778 0.89
32.331 i=1100000000 nset=968898478 0.88
35.282 i=1200000000 nset=1045164189 0.87
38.262 i=1300000000 nset=1119654810 0.86
41.251 i=1400000000 nset=1192424766 0.85
44.258 i=1500000000 nset=1263482593 0.84
47.268 i=1600000000 nset=1332897449 0.83
50.305 i=1700000000 nset=1400717923 0.82
53.356 i=1800000000 nset=1466962823 0.81
56.425 i=1900000000 nset=1531660191 0.81
59.489 i=2000000000 nset=1594856205 0.80
62.593 i=2100000000 nset=1656588855 0.79
65.706 i=2200000000 nset=1716895530 0.78
68.829 i=2300000000 nset=1775809288 0.77
71.966 i=2400000000 nset=1833353377 0.76
75.123 i=2500000000 nset=1889558599 0.76
78.271 i=2600000000 nset=1944463039 0.75
81.418 i=2700000000 nset=1998096496 0.74
84.574 i=2800000000 nset=2050490745 0.73
87.711 i=2900000000 nset=2101666393 0.72
90.852 i=3000000000 nset=2151669155 0.72
93.986 i=3100000000 nset=2200517580 0.71
97.084 i=3200000000 nset=2248232772 0.70
100.172 i=3300000000 nset=2294849331 0.70
103.232 i=3400000000 nset=2340389131 0.69
106.273 i=3500000000 nset=2384875319 0.68
109.272 i=3600000000 nset=2428339639 0.67
112.260 i=3700000000 nset=2470798655 0.67
115.199 i=3800000000 nset=2512271730 0.66
118.125 i=3900000000 nset=2552788321 0.65
121.029 i=4000000000 nset=2592379529 0.65
123.895 i=4100000000 nset=2631056297 0.64
126.726 i=4200000000 nset=2668843329 0.64
129.397 loops 4294967295 set 2703915179 (0.63)

That means we only ever generate about 64% of the possible hash keys.
That doesn't seem particularly healthy I guess, but for small hash
tables (with fewer than 2^32 buckets) that may not be an issue, as some
of the values will wrap because of the modulo.

For comparison, murmurhash3 and xxhash (both fairly popular hash
functions) end with 1:1 ratio of values and hashes, that is not having
any collisions at all. Of course, both are somewhat slower than the
simple hash functions we use for int/bigint values.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2017-12-10 23:35:42 Re: BUG #14938: ALTER TABLE hang/ poor performance
Previous Message Jeff Janes 2017-12-10 23:12:43 Re: BUG #14938: ALTER TABLE hang/ poor performance