Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: James Coleman <jtc331(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Date: 2019-11-10 17:29:25
Message-ID: 20191110172925.mlkglpwzwego3wtp@development
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote:
>On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>>
>> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331(at)gmail(dot)com> wrote:
>> > So I should have run the earlier attached plan with VERBOSE, but
>> > here's the interesting thing: the parallel hash node's seq scan node
>> > outputs two columns: let's call them (from the redacted plan)
>> > items.system_id and items.account_id. The first (system_id) is both
>> > not null and unique; the second (account_id) definitely has massive
>> > skew. I'm not very up-to-speed on how the hash building works, but I
>> > would have (perhaps naïvely?) assumed that the first column being
>> > unique would make the hash keys very likely not to collide in any
>> > significantly skewed way. Am I missing something here?
>>
>> Hrm. So the compound key is unique then. I was assuming up until now
>> that it had duplicates. The hashes of the individual keys are
>> combined (see ExecHashGetHashValue()), so assuming there is nothing
>> funky about the way citext gets hashed (and it's not jumping out at
>> me), your unique keys should give you uniform hash values and thus
>> partition size, and repartitioning should be an effective way of
>> reducing hash table size. So now it sounds like you have a simple
>> case of underestimation, but now I'm confused about how you got a
>> 344MB hash table with work_mem = 150MB:
>

Why would the compound key matter? The join only does this

Hash Cond: (group_details.group_number = items.system_id)

so the only thing that really matters when it comes to skew is
system_id, i.e. the citext column. The other column is used by
aggregation, but that's irrelevant here, no?

>Looking at the source, citext's hash is a pretty standard call to
>hash_any, so I don't see how that would lead to any oddities (barring
>an intentional hash collision etc., but this is real data).
>
>Do you have any theories about where the underestimation is happening?
>It knows the number of rows reasonably well. The one thing I'm not
>sure about yet is the row width = 16. The account_id is a bigint, so
>we can assume it knows the size properly there. The system_id being
>citext...I think the 8 bytes it has for that is probably a bit low on
>average, but I'm not sure yet by how much (I'm going to run a query to
>find out). Hmm. Query came back, and average length is just shy of
>7.9...so the 16 byte row size is looking pretty good. So I'm at a loss
>of why/what it would be underestimating (does it know about
>uniqueness/non-uniqueness? could that be a factor?)?
>

I think the easiest thing we can do is running a couple of queries
collecting useful stats, like

-- the most frequent system_id values
select system_id, count(*), sum(length(system_id)) from items
gtoup by system_id
order by count(*) desc limit 100;

-- the largest group by system_id
select system_id, count(*), sum(length(system_id)) from items
gtoup by system_id
order by sum(length(system_id)) desc limit 100;

That should tell us if there's something off.

>I also don't know why it seems to regularly fail on the primary, but
>not on the sync, unless we adjust the work_mem up. I've double-checked
>all GUCs and the only differences are things related to replication
>that you'd expect to be different on primary/replica.
>
>I know there are some things that execute differently on replicas, so
>I assume it's something like that, but I don't know what specifically
>would cause this here.
>

Hmm, I haven't realized it fails on primary more often. That's certainly
strange, I don't have a very good idea why that could be happening. Do
the query plans look the same? How does the EXPLAIN ANALYZE look like?

>> Buckets: 4194304 (originally 4194304) Batches: 32768 (originally
>> 4096) Memory Usage: 344448kB
>>
>> And I'm confused about what was different when it wanted the crazy
>> number of batches.
>
>I'm not quite sure how to find out; if you have any ideas, I'd love to
>hear them. The one thing I can think of to try is to slowly increase
>work_mem (since setting it to 500MB reproduced the error on the
>replica) and see if the bucket info starts to trend up.
>

Chances are it actually happened to detect skew and disabled the growth,
hence the table grew above work_mem. We should probably add info about
this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message James Coleman 2019-11-10 19:24:10 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
Previous Message James Coleman 2019-11-10 14:15:06 Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash