| From: | Tomas Vondra <tomas(at)vondra(dot)me> |
|---|---|
| To: | Gregory Smith <gregsmithpgsql(at)gmail(dot)com> |
| Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
| Subject: | Re: PG18 GIN parallel index build crash - invalid memory alloc request size |
| Date: | 2025-10-26 15:16:07 |
| Message-ID: | d28e1299-d8d7-4c05-9a64-cc412f57026b@vondra.me |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On 10/24/25 22:22, Gregory Smith wrote:
> On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas(at)vondra(dot)me
> <mailto:tomas(at)vondra(dot)me>> wrote:
>
> Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic
> in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB)
> makes the issue go away?
>
>
> The index builds at up to 4GB of m_w_m. 5GB and above crashes.
>
> Now that I know roughly where the limits are that de-escalates things a
> bit. The sort of customers deploying a month after release should be OK
> with just knowing to be careful about high m_w_m settings on PG18 until
> a fix is ready.
>
> Hope everyone is enjoying Latvia. My obscure music collection includes
> a band from there I used to see in the NYC area, The Quags; https://
> www.youtube.com/watch?v=Bg3P4736CxM <https://www.youtube.com/watch?
> v=Bg3P4736CxM>
>
Nice!
> Can you show the contents of buffer and tup? I'm especially interested
> in these fields:
> buffer->nitems
> buffer->maxitems
> buffer->nfrozen
> tup->nitems
>
>
> I'll see if I can grab that data at the crash point.
>
> FYI for anyone who wants to replicate this: if you have a system with
> 128GB+ of RAM you could probably recreate the test case. Just have to
> download the Planet file and run osm2pgsql with the overly tweaked
> settings I use. I've published all the details of how I run this
> regression test now.
>
> Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d
> <https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d>
> Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm-
> import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import>
> Test runner: https://github.com/gregs1104/pgbent/blob/main/util/osm-
> importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer>
> Parse results: https://github.com/gregs1104/pgbent/blob/main/util/
> pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/
> pgbench-init-parse>
>
I did reproduce this using OSM, although I used different settings, but
that's only affects loading. Setting maintenance_work_mem=20GB is more
than enough to trigger the error during parallel index build.
So I don't need the data.
>
> If I'm right, I think there are two ways to fix this:
> (1) apply the trimming earlier, i.e. try to freeze + flush before
> actually merging the data (essentially, update nfrozen earlier)
> (2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
> Or we probably should do both.
>
>
> Sounds like (2) is probably mandatory and (1) is good hygiene.
>
Yes, (2) is mandatory to fix this, and it's also sufficient. See the
attached fix. I'll clean this up and push soon.
AFAICS (1) is not really needed. I was concerned we might end up with
each worker producing a TID buffer close to maintenance_work_mem, and
then the leader would have to use twice as much memory when merging. But
it turns out I already thought about that, and the workers use a fair
share or maintenance_work_mem, not a new limit. So they produce smaller
chunks, and those should not exceed maintenance_work_mem when merging.
I tried "freezing" the existing buffer more eagerly (before merging the
tuple), but that made no difference. The workers produce data with a lot
of overlaps (simply because that's how the parallel builds divide data),
and the amount of trimmed data is tiny. Something like 10k TIDs from a
buffer of 1M TIDs. So a tiny difference, and it'd still fail.
I'm not against maybe experimenting with this, but it's going to be a
mater-only thing, not for backpatching.
Maybe we should split the data into smaller chunks while building tuples
in ginFlushBuildState. That'd probably allow enforcing the memory limit
more strictly, because we sometimes hold multiple copies of the TIDs
arrays. But that's for master too.
regards
--
Tomas Vondra
| Attachment | Content-Type | Size |
|---|---|---|
| gin-palloc-fix.patch | text/x-patch | 848 bytes |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jeff Janes | 2025-10-26 16:47:36 | Should HashSetOp go away |
| Previous Message | Dmitry Dolgov | 2025-10-26 11:16:14 | Re: Bug in pg_stat_statements |