| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Gregory Smith <gregsmithpgsql(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: PG18 GIN parallel index build crash - invalid memory alloc request size |
| Date: | 2025-10-26 21:52:47 |
| Message-ID: | 4acb6920-9270-4ee2-a6bf-a2c368947122@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 10/26/25 16:16, Tomas Vondra wrote:
> On 10/24/25 22:22, Gregory Smith wrote:
>> On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas(at)vondra(dot)me
>> <mailto:tomas(at)vondra(dot)me>> wrote:
>>
>> ...
>>>> Can you show the contents of buffer and tup? I'm especially
interested
>> in these fields:
>> buffer->nitems
>> buffer->maxitems
>> buffer->nfrozen
>> tup->nitems
>>
>>
>> I'll see if I can grab that data at the crash point.
>>
>> FYI for anyone who wants to replicate this: if you have a system with
>> 128GB+ of RAM you could probably recreate the test case. Just have to
>> download the Planet file and run osm2pgsql with the overly tweaked
>> settings I use. I've published all the details of how I run this
>> regression test now.
>>
>> Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d
>> <https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d>
>> Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm-
>> import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import>
>> Test runner: https://github.com/gregs1104/pgbent/blob/main/util/osm-
>> importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer>
>> Parse results: https://github.com/gregs1104/pgbent/blob/main/util/
>> pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/
>> pgbench-init-parse>
>>
>
> I did reproduce this using OSM, although I used different settings, but
> that's only affects loading. Setting maintenance_work_mem=20GB is more
> than enough to trigger the error during parallel index build.
>
> So I don't need the data.
>
>>
>> If I'm right, I think there are two ways to fix this:
>> (1) apply the trimming earlier, i.e. try to freeze + flush before
>> actually merging the data (essentially, update nfrozen earlier)
>> (2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
>> Or we probably should do both.
>>
>>
>> Sounds like (2) is probably mandatory and (1) is good hygiene.
>>
>
> Yes, (2) is mandatory to fix this, and it's also sufficient. See the
> attached fix. I'll clean this up and push soon.
>
> AFAICS (1) is not really needed. I was concerned we might end up with
> each worker producing a TID buffer close to maintenance_work_mem, and
> then the leader would have to use twice as much memory when merging. But
> it turns out I already thought about that, and the workers use a fair
> share or maintenance_work_mem, not a new limit. So they produce smaller
> chunks, and those should not exceed maintenance_work_mem when merging.
>
> I tried "freezing" the existing buffer more eagerly (before merging the
> tuple), but that made no difference. The workers produce data with a lot
> of overlaps (simply because that's how the parallel builds divide data),
> and the amount of trimmed data is tiny. Something like 10k TIDs from a
> buffer of 1M TIDs. So a tiny difference, and it'd still fail.
>
> I'm not against maybe experimenting with this, but it's going to be a
> mater-only thing, not for backpatching.
>
> Maybe we should split the data into smaller chunks while building tuples
> in ginFlushBuildState. That'd probably allow enforcing the memory limit
> more strictly, because we sometimes hold multiple copies of the TIDs
> arrays. But that's for master too.
>
I spoke too soon, apparently :-(
(2) is not actually a fix. It does fix some cases of invalid alloc size
failures, the following call to ginMergeItemPointers() can hit that too,
because it does palloc() internally. I didn't notice this before because
of the other experimental changes, and because it seems to depend on
which of the OSM indexes is being built, with how many workers, etc.
I was a bit puzzled how come we don't hit this with serial builds too,
because that calls ginMergeItemPointers() too. I guess that's just luck,
because with serial builds we're likely flushing the TID list in smaller
chunks, appending to an existing tuple. And it seems unlikely to cross
the alloc limit for any of those. But for parallel builds we're pretty
much guaranteed to see all TIDs for a key at once.
I see two ways to fix this:
a) Do the (re)palloc_huge change, but then also change the palloc call
in ginMergeItemPointers. I'm not sure if we want to change the existing
function, or create a static copy in gininsert.c with this tweak (it
doesn't need anything else, so it's not that bad).
b) Do the data splitting in ginFlushBuildState, so that workers don't
generate chunks larger than MaxAllocSize/nworkers (for any key). The
leader then merges at most one chunk per worker at a time, so it still
fits into the alloc limit.
Both seem to work. I like (a) more, because it's more consistent with
how I understand m_w_m. It's weird to say "use up to 20GB of memory" and
then the system overrides that with "1GB". I don't think it affects
performance, though.
I'll experiment with this a bit more, I just wanted to mention the fix I
posted earlier does not actually fix the issue.
I also wonder how far are we from hitting the uint32 limits. FAICS with
m_w_m=24GB we might end up with too many elements, even with serial
index builds. It'd have to be a quite weird data set, though.
regards
--
Tomas Vondra
| Attachment | Content-Type | Size |
|---|---|---|
| 0001-a-use-huge-allocations.patch | text/x-patch | 3.6 KB |
| 0001-b-split-TID-lists-when-flushing.patch | text/x-patch | 2.7 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2025-10-26 22:11:26 | Re: Use BumpContext contexts for TupleHashTables' tablecxt |
| Previous Message | David Rowley | 2025-10-26 21:46:01 | Re: Use BumpContext contexts for TupleHashTables' tablecxt |