Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From: James Coleman <jtc331(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date: 2019-11-09 13:23:23
Message-ID: CAAaqYe-NL4cCeaH4o121nBOWT_a5S2D-060qUqJYQp95k+wqZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Just realized I accidentally replied only to Tomas.

On Sat, Nov 9, 2019 at 4:55 AM Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
> On Fri, Nov 08, 2019 at 09:10:13PM -0500, James Coleman wrote:
> >On Fri, Nov 8, 2019 at 8:12 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> >>
> >> On Sat, Nov 9, 2019 at 1:23 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
> >> > On Fri, Nov 8, 2019 at 6:30 PM Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> >> > > On Fri, Nov 08, 2019 at 09:52:16PM +0000, PG Bug reporting form wrote:
> >> > > >ERROR: invalid DSA memory alloc request size 1375731712
> >>
> >> > > >#3 0x000055a7c079bd17 in ExecParallelHashJoinSetUpBatches
> >> > > >(hashtable=hashtable(at)entry=0x55a7c1db2740, nbatch=nbatch(at)entry=2097152) at
> >>
> >> > > I've briefly looked at this today, and I think the root cause is
> >> > > somewhat similar to what is described in [1] where we simply increase
> >> > > the number of batches in an effort to keep batch contents in work_mem,
> >> > > but ignoring that each batch requires quite a bit of memory. So we end
> >> > > up with a lot of batches where each is small enough to fit into
> >> > > work_mem, but we need much more than work_mem to track the batches.
> >>
> >> Yeah. So even when this is fixed, the query is going to perform
> >> *terribly*, opening and closing millions of files in random order to
> >> stream tuples into, if this is case where there really are tuples to
> >> go to all partitions (and not just a case of extreme skew that our
> >> extreme skew detector fails to detect because it only detects absolute
> >> extreme skew).
> >
> >work_mem in the repro case is 500MB (the original failure was at
> >150MB). I realize that's too small for this query, though it's also
> >worth knowing that if I get rid of some other cluster-wide tunings
> >that shouldn't have been cluster-wide original (modifications to
> >cpu_*_cost), the seq scan on a TB+ table feeding the hash turns into
> >an index scan and no hash (and performs much better).
> >
>
> So is it a case of underestimate? I.e. does the Hash side expect much
> less data (rows) than it gets during execution?

I've attached a redacted plan, but the estimate with workers planned =
6 is 986,975,281, which seems quite reasonable to be as the current
table count is 5,917,539,491.

> >I think this also correlates with us seeing ~TB spike in disk usage,
> >so your explanation of the lots of "small" files would seem to be
> >consistent with that.
>
> That's consistent with the data. 500MB and nbatch=2097152 is exactly
> 1TB, and there'll be some additional overhead.

Good, that helps make sense of that piece of the puzzle.

> >> > > This seems to be about the same problem, except that instead of
> >> > > forgeting about BufFile, the parallel hash join ignores this:
> >> > >
> >> > > pstate->batches =
> >> > > dsa_allocate0(hashtable->area,
> >> > > EstimateParallelHashJoinBatch(hashtable) * nbatch);
> >>
> >> Yeah, I failed to consider that possibility. I suppose it could be
> >> avoided with something like this (not tested, I will find a repro for
> >> this on Monday to convince myself that it's right):
> >>
> >> @@ -1246,7 +1246,10 @@
> >> ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
> >> }
> >>
> >> /* Don't keep growing if it's not
> >> helping or we'd overflow. */
> >> - if (extreme_skew_detected ||
> >> hashtable->nbatch >= INT_MAX / 2)
> >> + if (extreme_skew_detected ||
> >> + hashtable->nbatch >= INT_MAX / 2 ||
> >> +
> >> !AllocSizeIsValid(EstimateParallelHashJoinBatch(hashtable) *
> >> +
> >> hashtable->nbatch * 2))
> >> pstate->growth = PHJ_GROWTH_DISABLED;
> >> else if (space_exhausted)
> >> pstate->growth =
> >> PHJ_GROWTH_NEED_MORE_BATCHES;
> >>
>
> Probably, but that's the execution part of it. I think we should also
> consider this in ExecChooseHashTableSize, and just don't do PHJ when
> it exceeds work_mem from the very beginning.
>
> Once we start executing we probably can't do much better than what you
> proposed. In particular it doesn't make sense to limit the space by
> work_mem, unless we also tweak that limit because then the batch size
> increases arbitrarily.
>
> I think we need do something about this in PG13 - both while planning
> (considering BufFile and SharedTuplestore), and during execution. The
> planner part seems fairly simple and independent, and I might have said
> before I'll look into it.
>
> For the executor I think we've agreed the "proper" solution is BNL or
> something like that. Not sure how far are we from that, though, I
> don't recall any recent updates (but maybe I just missed that, the
> pgsql-hackers traffic is pretty insane). I wonder if we should get
> something like the "rebalancing" I proposed before, which is not a 100%
> fix but may at least reduce the negative impact.

Do you happen have a link to those discussions? I'd be interested in
following along. I also can't say I know what BNL stands for.

> >> But James's query is still going to be terrible.
> >>
> >> Do you know if it's extreme skew (many tuples with the same key, just
> >> a few scattered around in other keys), or simply too much data for
> >> your work_mem setting?
> >
> >Given my description earlier (seq scan on a very large table), I
> >assume it's likely the latter? If you think that's sufficiently
> >likely, I'll leave it at that, or if not I could do calculation on
> >that key to see how distributed it is.
> >
>
> Depends where exactly is the seq scan - is it under the Hash? If yes,
> how come we even pick hash join in this case? Hard to say without seeing
> the query plan.

As noted earlier, I've attached a redacted plan here.

The seq scan is under the parallel hash. Note that also as hinted at
earlier, this happens with cpu_tuple_cost and cpu_operator_cost both
set to 0.5, which has been a long-standing tweak on this cluster to
affect some plans that are otherwise painfully bad, but really should
be targeted at specific queries. With those reset, everything stays
the same except that the hash join turns into a nested loop, and
instead of a hash on the inner side of the join there's an index scan
(the join key is indexed on that table).

Because of the modified cpu_*_cost gucs we have two ways around this
specific query plan: either reset those gucs or set
enable_parallel_hash = off. But even if it means poor performance, it
seems to me that we wouldn't want to fail the query. I can confirm
that it is in fact capable of completing with this plan even with
work_mem = 150MB. It's slower...but "only" by 7-8x. That timing would
have been some level of unsurprising in this batch system, and also
much closer to "normal" for the non-parallel plan we used to have in
9.6. We were able to see this result on replicas while trying to find
out exactly how to reproduce the error (it seems sometimes it was
right under the boundary needed to raise the error).

James

Attachment Content-Type Size
error_plan_redacted.txt text/plain 4.4 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ryan Lambert 2019-11-09 14:49:27 Re: EXPLAIN ANALYZE not displaying recheck condition
Previous Message Amit Kapila 2019-11-09 11:37:42 Re: Reorderbuffer crash during recovery