Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From: James Coleman <jtc331(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date: 2019-11-09 20:05:26
Message-ID: CAAaqYe8=xUUwYp+XO6Y3YNqL0Nk+UXrqxYjT46Z+q6b-5U1N1Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Nov 9, 2019 at 1:27 PM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Sat, Nov 09, 2019 at 08:23:23AM -0500, James Coleman wrote:
> >Just realized I accidentally replied only to Tomas.
> >
> >On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
> ><tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >>
> >> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
> >> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >> >>
> >> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> >> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >> >> > > >ERROR: invalid DSA memory alloc request size 1375731712
> >> >>
> >> >> > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >> >> > > >(hashtable=hashtable(at)entry=0x55a7c1db2740, nbatch=nbatch(at)entry=2097152) at
> >> >>
> >> >> > > I've briefly looked at this today, and I think the root cause is
> >> >> > > somewhat similar to what is described in [1] where we simply increase
> >> >> > > the number of batches in an effort to keep batch contents in work_mem,
> >> >> > > but ignoring that each batch requires quite a bit of memory. So we end
> >> >> > > up with a lot of batches where each is small enough to fit into
> >> >> > > work_mem, but we need much more than work_mem to track the batches.
> >> >>
> >> >> Yeah. So even when this is fixed, the query is going to perform
> >> >> *terribly*, opening and closing millions of files in random order to
> >> >> stream tuples into, if this is case where there really are tuples to
> >> >> go to all partitions (and not just a case of extreme skew that our
> >> >> extreme skew detector fails to detect because it only detects absolute
> >> >> extreme skew).
> >> >
> >> >work_mem in the repro case is 500MB (the original failure was at
> >> >150MB). I realize that's too small for this query, though it's also
> >> >worth knowing that if I get rid of some other cluster-wide tunings
> >> >that shouldn't have been cluster-wide original (modifications to
> >> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
> >> >an index scan and no hash (and performs much better).
> >> >
> >>
> >> So is it a case of underestimate? I.e. does the Hash side expect much
> >> less data (rows) than it gets during execution?
> >
> >I've attached a redacted plan, but the estimate with workers planned =
> >6 is 986,975,281, which seems quite reasonable to be as the current
> >table count is 5,917,539,491.
> >
>
> Hmmm, but the expected row width is only 16B, and with 6M rows that's
> only about 90GB. So how come this needs 1TB temporary files? I'm sure
> there's a bit of overhead, but 10X seems a bit much.

I'm running the query again to confirm this is indeed the one that
triggers that, though I'm reasonably confident it is as I was able to
reproduce the disk spike on a barely used replica box.

I don't suppose there's any way for the row width to be wrong, or not
include some other variable (e.g., variable length fields)? The key
value here is type citext, and is often 8 chars long, but sometimes
much longer (I sampled some that were 34 chars long, for example). I'm
not sure the actual distribution of those lengths.

> >> >I think this also correlates with us seeing ~TB spike in disk usage,
> >> >so your explanation of the lots of "small" files would seem to be
> >> >consistent with that.
> >>
> >> That's consistent with the data. 500MB and nbatch=2097152 is exactly
> >> 1TB, and there'll be some additional overhead.
> >
> >Good, that helps make sense of that piece of the puzzle.
> >
> >> >> > > This seems to be about the same problem, except that instead of
> >> >> > > forgeting about BufFile, the parallel hash join ignores this:
> >> >> > >
> >> >> > > pstate->batches =
> >> >> > > dsa_allocate0(hashtable->area,
> >> >> > > EstimateParallelHashJoinBatch(hashtable) * nbatch);
> >> >>
> >> >> Yeah, I failed to consider that possibility. I suppose it could be
> >> >> avoided with something like this (not tested, I will find a repro for
> >> >> this on Monday to convince myself that it's right):
> >> >>
> >> >> @@ -1246,7 +1246,10 @@
> >> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
> >> >> }
> >> >>
> >> >> /* Don't keep growing if it's not
> >> >> helping or we'd overflow. */
> >> >> - if (extreme_skew_detected ||
> >> >> hashtable->nbatch >= INT_MAX / 2)
> >> >> + if (extreme_skew_detected ||
> >> >> + hashtable->nbatch >= INT_MAX / 2 ||
> >> >> +
> >> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
> >> >> +
> >> >> hashtable->nbatch * 2))
> >> >> pstate->growth = PHJ_GROWTH_DISABLED;
> >> >> else if (space_exhausted)
> >> >> pstate->growth =
> >> >> PHJ_GROWTH_NEED_MORE_BATCHES;
> >> >>
> >>
> >> Probably, but that's the execution part of it. I think we should also
> >> consider this in ExecChooseHashTableSize, and just don't do PHJ when
> >> it exceeds work_mem from the very beginning.
> >>
> >> Once we start executing we probably can't do much better than what you
> >> proposed. In particular it doesn't make sense to limit the space by
> >> work_mem, unless we also tweak that limit because then the batch size
> >> increases arbitrarily.
> >>
> >> I think we need do something about this in PG13 - both while planning
> >> (considering BufFile and SharedTuplestore), and during execution. The
> >> planner part seems fairly simple and independent, and I might have said
> >> before I'll look into it.
> >>
> >> For the executor I think we've agreed the "proper" solution is BNL or
> >> something like that. Not sure how far are we from that, though, I
> >> don't recall any recent updates (but maybe I just missed that, the
> >> pgsql-hackers traffic is pretty insane). I wonder if we should get
> >> something like the "rebalancing" I proposed before, which is not a 100%
> >> fix but may at least reduce the negative impact.
> >
> >Do you happen have a link to those discussions? I'd be interested in
> >following along. I also can't say I know what BNL stands for.
> >
>
> I think the two main threads are these two:
>
> 1) accounting for memory used for BufFile during hash joins
> https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development
>
> 2) Avoiding hash join batch explosions with extreme skew and weird stats
> https://www.postgresql.org/message-id/CA%2BhUKGKWWmf%3DWELLG%3DaUGbcugRaSQbtm0tKYiBut-B2rVKX63g%40mail.gmail.com
>
> The BNL means "block nested loop" which means we split the inner
> relation into blocks, and then do nested loop for (some of) them. This
> works as a fallback strategy for hash join - when a batch is too large
> to fit into work_mem, we can split it into blocks and do a BNL against
> the outer relation.

Thanks.

> >> >> But James's query is still going to be terrible.
> >> >>
> >> >> Do you know if it's extreme skew (many tuples with the same key, just
> >> >> a few scattered around in other keys), or simply too much data for
> >> >> your work_mem setting?
> >> >
> >> >Given my description earlier (seq scan on a very large table), I
> >> >assume it's likely the latter? If you think that's sufficiently
> >> >likely, I'll leave it at that, or if not I could do calculation on
> >> >that key to see how distributed it is.
> >> >
> >>
> >> Depends where exactly is the seq scan - is it under the Hash? If yes,
> >> how come we even pick hash join in this case? Hard to say without seeing
> >> the query plan.
> >
> >As noted earlier, I've attached a redacted plan here.
> >
> >The seq scan is under the parallel hash. Note that also as hinted at
> >earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
> >set to 0.5, which has been a long-standing tweak on this cluster to
> >affect some plans that are otherwise painfully bad, but really should
> >be targeted at specific queries. With those reset, everything stays
> >the same except that the hash join turns into a nested loop, and
> >instead of a hash on the inner side of the join there's an index scan
> >(the join key is indexed on that table).
> >
>
> I wonder what the bad plans look like and why this fixes them, but it
> seems like a separate issue.

I am going to look into those further, but it's been configured this
way for some time, so it's always a bit messy to find all of these
affected queries.

> >Because of the modified cpu_*_cost gucs we have two ways around this
> >specific query plan: either reset those gucs or set
> >enable_parallel_hash = off. But even if it means poor performance, it
> >seems to me that we wouldn't want to fail the query. I can confirm
> >that it is in fact capable of completing with this plan even with
> >work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
> >have been some level of unsurprising in this batch system, and also
> >much closer to "normal" for the non-parallel plan we used to have in
> >9.6. We were able to see this result on replicas while trying to find
> >out exactly how to reproduce the error (it seems sometimes it was
> >right under the boundary needed to raise the error).
> >
>
> It's certainly true that completing a query is preferrable to failing.
> It does depend when we can identify that's the case - the later we
> realize the issue, the harder it's to fix it. If we notice that during
> planning, we can simply disable the hash join, which then forces using a
> different join method (and it may actually be quite fast). Once we start
> executing, it's way harder.

In this case though the failure in some sense seems fairly artificial.
Aside from the query being slow, there doesn't appear to be any real
limitation on the query completing. The box in question has 768GB of
memory, so limiting this memory structure to 1GB seems artificial.

James

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Thomas Munro 2019-11-09 21:53:19 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message Tomas Vondra 2019-11-09 18:27:35 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash