Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date: 2019-11-09 21:53:19
Message-ID: CA+hUKGJFNg0XqXiBNzOeTPsCa4tb-a00Sv5EA7jD+w26ATyUyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sun, Nov 10, 2019 at 9:05 AM James Coleman <jtc331(at)gmail(dot)com> wrote:
> In this case though the failure in some sense seems fairly artificial.
> Aside from the query being slow, there doesn't appear to be any real
> limitation on the query completing. The box in question has 768GB of
> memory, so limiting this memory structure to 1GB seems artificial.

Yeah, completely artificial. We limit ourselves to MaxAllocSize, a
self-imposed limit that applies to palloc() and dsa_allocate(),
probably dating back to a time in history when anyone asking for more
than that must surely be out of their mind. Then there are a few
places where we allow ourselves to go past that limit by passing in an
extra flag MCXT_ALLOC_HUGE or DSA_ALLOC_HUGE: from a quick grep,
that's the bitmap heap scan page table, and anything using
simplehash.h's default allocator (so I guess that includes hash agg;
that's interesting, we can do a 64GB-bucket-array hash agg but not
hash join).

As noted on https://wiki.postgresql.org/wiki/Hash_Join, there are a
couple of artificial constraints on hash joins: the limit on the
number of hash join buckets which comes entirely from using 'int' as a
data type for bucket numbers (an anachronism from pre-standard C or
maybe just 32 bit-ism: the correct type is surely size_t, which is by
definition big enough for any array that you could possibly address),
and the MaxAllocSize thing. Easy to fix, of course. I noticed that
when working on PHJ and decided to keep the same restriction for the
new parallel code paths, because (1) it seemed like a policy choice we
should make and then apply to both, and (2) it does provide some kind
of sanity check, though it's increasingly looking overdue to be
removed (in other words: previously I was only speculating about
people showing up with ~1TB RAM machines and running into these
ancient limits, but ... here we are).

For example, suppose we added the DSA_ALLOC_HUGE flag to the line that
is failing in your case. Now it would be able to blow through that
1GB limit, but where would it stop? Until we know how you're reaching
this state, it's hard to know whether it'd go to (say) 2GB, and then
work perfectly, job done, or whether it'd keep going until it ate all
your RAM and made your users really unhappy.

I think this must be a case of extreme skew, as complained about
in[1]. Let's see... you have ~6 billion rows, and you said the
planner knew that (it estimated about a billion, when there were 6
workers, so it's in the ball park). You didn't say how many batches
the planner planned for. Let's see if I can guess... 256 or 512?
That'd allow for 6 billion * 16 byte rows + some slop, chopped up into
a power-of-two number of partitions that fit inside 500MB. Then at
execution time, they didn't fit, and we went into
repartition-until-it-fits mode. At some point we tried to cut them
into ~2 million partitions and hit this error. That'd be a paltry
3,000 rows per partition if the keys were uniformly distributed, but
it thinks that at least one partition is not fitting into 500MB.
Conclusion: unless this is a carefully crafted hash attack, there must
be one particular key has more than 500MB worth of rows, but also a
smattering of other keys that fall into the same partition, that are
causing us to keep trying to repartition until it eventually squeezes
all of them all in the same direction during a split (this requires
repeatedly splitting partitions until you reach one partition per
tuple!). So I'm wondering if this would be fixed by, say, a 95%
threshold (instead of 100%) for the extreme skew detector, as I
proposed in a patch in the first email in that thread that later
veered off into the BNL discussion[1]. Unfortunately that patch only
deals with non-parallel HJ, but a PHJ version should be easy. And if
not by 95%, I wonder what threshold would be needed for this case, and
what kind of maths to use to think about it. If I wrote a patch like
[1] with PHJ support, would you be able to test it on a copy of your
workload?

[1] https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2019-11-09 23:13:59 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message James Coleman 2019-11-09 20:05:26 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash