Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records

From: Dmitriy Kuzmin <kuzmin(dot)db4(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Startup process on a hot standby crashes with an error "invalid memory alloc request size 1073741824" while replaying "Standby/LOCK" records
Date: 2022-09-05 10:19:58
Message-ID: CAHLDt=_ts0A7Agn=hCpUh+RCFkxd+G6uuT=kcTfqFtGur0dp=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Greetings!

One of our clients experienced a crash of startup process with an error
"invalid memory alloc request size 1073741824" on a hot standby, which
ended in replica reinit.

According to logs, startup process crashed while trying to replay
"Standby/LOCK" record with a huge list of locks(see attached
replicalog_tail.tgz):

FATAL: XX000: invalid memory alloc request size 1073741824
CONTEXT: WAL redo at 7/327F9248 for Standby/LOCK: xid 1638575 db 7550635
rel 8500880 xid 1638575 db 7550635 rel 10324499...
LOCATION: repalloc, mcxt.c:1075
BACKTRACE:
postgres: startup recovering
000000010000000700000033(repalloc+0x61) [0x8d7611]
postgres: startup recovering 000000010000000700000033() [0x691c29]
postgres: startup recovering 000000010000000700000033() [0x691c74]
postgres: startup recovering 000000010000000700000033(lappend+0x16)
[0x691e76]
postgres: startup recovering
000000010000000700000033(StandbyAcquireAccessExclusiveLock+0xdd) [0x7786bd]
postgres: startup recovering
000000010000000700000033(standby_redo+0x5d) [0x7789ed]
postgres: startup recovering
000000010000000700000033(StartupXLOG+0x1055) [0x51d7c5]
postgres: startup recovering
000000010000000700000033(StartupProcessMain+0xcd) [0x71d65d]
postgres: startup recovering
000000010000000700000033(AuxiliaryProcessMain+0x40c) [0x52c7cc]
postgres: startup recovering 000000010000000700000033() [0x71a62e]
postgres: startup recovering
000000010000000700000033(PostmasterMain+0xe74) [0x71cf74]
postgres: startup recovering 000000010000000700000033(main+0x70d)
[0x4891ad]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f0be2db3555]
LOG: 00000: startup process (PID 2650) exited with exit code 1

Looks like startup process at some point hits the MaxAllocSize
limit(memutils.h), which forbids allocation of more than 1gb-1 bytes.

Judging by pg_waldump output, there was long running transaction on
primary, that sequentially locked and modified a lot of tables. Right
before the crash there was about 85k exclusively locked tables.

Trying to reproduce the issue, I found out that the problem is not so much
in the number of locks, but in the duration of the transaction. Replaying
"Standby/LOCK" records, the startup process eventually crashes with the
mentioned error if a long transaction holds a large number of locks long
enough.

I managed to reproduce the situation on 13.7, 14.4, 15beta3 and master
using the following steps:
1) get primary and replica with the following settings:
max_locks_per_transaction = '10000' and max_connections = '1000'
2) create 950k tables
3) lock them in AccessExclusive mode in transaction and leave it in "idle
in transaction state"
4) make some side activity with pgbench (pgbench -R 100 -P 5 -T 7200 -c 1)
In about 20-30 minutes startup process crashes with the same error.

As far as I understand, there is fixed amount of AccessExclusive locks in
this scenario. 950k exclusive locks acquired by "long running" transaction
and no additional exclusive locks held by pgbench. But startup consumes
more and more memory while replaying records, that contain exacly the same
list of locks. Could it be a memory leak? If not, is there any way to
improve this behavior?

If you're going to reproduce it, get primary and replica with enough RAM
and simultaneously run on primary:
$ *pgbench -i && pgbench -R 100 -P 5 -T 7200 -c 1* in one terminal
$ *psql -f 950k_locks.sql* in another terminal
and observe startup memory usage and replica's logs.

Best regards,
Dmitry Kuzmin

Attachment Content-Type Size
950k_locks.sql application/sql 706 bytes
replicalog_tail.tgz application/x-compressed-tar 4.2 MB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2022-09-05 11:09:27 BUG #17607: Server process crashes when PLpgSQL function raises error in subtransaction
Previous Message Chanderprabh Jain 2022-09-05 09:56:09 Re: BUG #15333: pg_dump error on large table -- "pg_dump: could not stat file...Unknown error"

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2022-09-05 10:46:10 Re: Column Filtering in Logical Replication
Previous Message Peter Smith 2022-09-05 10:16:09 Re: Column Filtering in Logical Replication