Re: Bug in huge simplehash

From: Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Subject: Re: Bug in huge simplehash
Date: 2021-08-10 11:21:36
Message-ID: CAEudQAq_Y-KhtC+=jHRT+3Mhwu17WB8eutTJDy_LBLAPJ-+U4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Em ter., 10 de ago. de 2021 às 05:53, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
escreveu:

> Good day, hackers.
>
> Our client caught process stuck in tuplehash_grow. There was a query
> like
> `select ts, count(*) from really_huge_partitioned_table group by ts`,
> and
> planner decided to use hash aggregation.
>
> Backtrace shows that oldsize were 2147483648 at the moment. While
> newsize
> were optimized, looks like it were SH_MAX_SIZE.
>
> #0 0x0000000000603d0c in tuplehash_grow (tb=0x7f18c3c284c8,
> newsize=<optimized out>) at ../../../src/include/lib/simplehash.h:457
> hash = 2147483654
> startelem = 1
> curelem = 1
> oldentry = 0x7f00c299e0d8
> oldsize = 2147483648
> olddata = 0x7f00c299e048
> newdata = 0x32e0448
> i = 6
> copyelem = 6
>
> EXPLAIN shows that there are 2604186278 rows in all partitions, but
> planner
> thinks there will be only 200 unique rows after group by. Looks like we
> was
> mistaken.
>
> Finalize GroupAggregate (cost=154211885.42..154211936.09 rows=200
> width=16)
> Group Key: really_huge_partitioned_table.ts
> -> Gather Merge (cost=154211885.42..154211932.09 rows=400
> width=16)
> Workers Planned: 2
> -> Sort (cost=154210885.39..154210885.89 rows=200 width=16)
> Sort Key: really_huge_partitioned_table.ts
> -> Partial HashAggregate
> (cost=154210875.75..154210877.75 rows=200 width=16)
> Group Key: really_huge_partitioned_table.ts
> -> Append (cost=0.43..141189944.36
> rows=2604186278 width=8)
> -> Parallel Index Only Scan using
> really_huge_partitioned_table_001_idx2 on
> really_huge_partitioned_table_001 (cost=0.43..108117.92 rows=2236977
> width=8)
> -> Parallel Index Only Scan using
> really_huge_partitioned_table_002_idx2 on
> really_huge_partitioned_table_002 (cost=0.43..114928.19 rows=2377989
> width=8)
> .... and more than 400 partitions more
>
> After some investigation I found bug that is present in simplehash from
> its
> beginning:
>
> - sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in this
> way:
>
> /* now set size */
> tb->size = size;
>
> if (tb->size == SH_MAX_SIZE)
> tb->sizemask = 0;
> else
> tb->sizemask = tb->size - 1;
>
> that means, when we are resizing to SH_MAX_SIZE, sizemask becomes
> zero.
>
> - then sizemask is used to SH_INITIAL_BUCKET and SH_NEXT to compute
> initial and
> next position:
>
> SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash)
> return hash & tb->sizemask;
> SH_NEXT(SH_TYPE * tb, uint32 curelem, uint32 startelem)
> curelem = (curelem + 1) & tb->sizemask;
>
> - and then SH_GROW stuck in element placing loop:
>
> startelem = SH_INITIAL_BUCKET(tb, hash);
> curelem = startelem;
> while (true)
> curelem = SH_NEXT(tb, curelem, startelem);
>
> There is Assert(curelem != startelem) in SH_NEXT, but since no one test
> it
> with 2 billion elements, it were not triggered. And Assert is not
> compiled
> in production code.
>
> Attached patch fixes it with removing condition and type casting:
>
> /* now set size */
> tb->size = size;
> tb->sizemask = (uint32)(size - 1);
>
>
> OOOOOOPS
>
> While writting this letter, I looke at newdata in the frame of
> tuplehash_grow:
>
> newdata = 0x32e0448
>
> It is bellow 4GB border. Allocator does not allocate many-gigabytes
> chunks
> (and we certainly need 96GB in this case) in sub 4GB address space.
> Because
> mmap doesn't do this.
>
> I went to check SH_GROW and.... It is `SH_GROW(SH_TYPE *tb, uint32
> newsize)`
> :-(((
> Therefore when `tb->size == SH_MAX_SIZE/2` and we call `SH_GROW(tb,
> tb->size * 2)`,
> then SH_GROW(tb, 0) is called due to truncation.
> And SH_COMPUTE_PARAMETERS is also accepts `uint32 newsize`.
>
> Ahh... ok, patch is updated to fix this as well.
>
It seems that we need to fix the function prototype too.

/* void <prefix>_grow(<prefix>_hash *tb) */
-SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize);
+SH_SCOPE void SH_GROW(SH_TYPE * tb, uint64 newsize);

regards,
Ranier Vilela

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2021-08-10 11:24:12 Re: Added schema level support for publication.
Previous Message Masahiko Sawada 2021-08-10 11:08:35 Re: Skipping logical replication transactions on subscriber side