| From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Unexpected behavior after OOM errors |
| Date: | 2026-06-17 06:00:00 |
| Message-ID: | e77acaac-a1b3-40b3-99ee-5769b4e453e4@gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
Hello hackers,
I'd like to share my findings related to OOM error handling. I'm not sure
how large the class of such anomalies is (and if all of these can be
detected and fixed), but please look at a few issues I have discovered so
far:
1) An issue in lookup_type_cache()
The following modification:
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -104,6 +104,7 @@
#include "storage/spin.h"
#include "utils/memutils.h"
+#include "common/pg_prng.h"
/*
* Constants
@@ -528,7 +529,7 @@ hash_create(const char *tabname, int64 nelem, const HASHCTL *info, int flags)
* that this is the first allocation made with the alloc function. That's
* a little ugly, but works for now.
*/
- hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR), hashp->alloc_arg);
+ hashp->hctl = (pg_prng_double(&pg_global_prng_state) < 0.001) ? NULL : (HASHHDR *) hashp->alloc(sizeof(HASHHDR),
hashp->alloc_arg);
if (!hashp->hctl)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -609,7 +610,7 @@ hash_create(const char *tabname, int64 nelem, const HASHCTL *info, int flags)
{
int temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
- if (!element_alloc(hashp, temp, i))
+ if ((pg_prng_double(&pg_global_prng_state) < 0.001) || !element_alloc(hashp, temp, i))
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
makes this script:
for i in {1..10000}; do
cat << 'EOS' | psql >>psql.log
SELECT 1 ORDER BY 1;
SELECT 1 ORDER BY 1;
EOS
grep "terminated by signal" server.log && break;
done
trigger an assertion failure:
2026-06-17 07:26:07.837 EEST [87325:3] [unknown] LOG: connection authorized: user=law database=regression
application_name=psql
2026-06-17 07:26:07.837 EEST [87325:4] psql LOG: statement: SELECT 1 ORDER BY 1;
2026-06-17 07:26:07.837 EEST [87325:5] psql ERROR: out of memory at character 19
2026-06-17 07:26:07.837 EEST [87325:6] psql LOG: statement: SELECT 1 ORDER BY 1;
TRAP: failed Assert("TypeCacheHash != NULL && RelIdToTypeIdCacheHash != NULL"), File: "typcache.c", Line: 441, PID: 87325
ExceptionalCondition at assert.c:51:13
lookup_type_cache at typcache.c:444:27
get_sort_group_operators at parse_oper.c:207:13
addTargetToSortList at parse_clause.c:3647:4
transformSortClause at parse_clause.c:2959:14
transformSelectStmt at analyze.c:1806:18
transformStmt at analyze.c:396:15
transformOptionalSelectInto at analyze.c:327:1
transformTopLevelStmt at analyze.c:276:11
parse_analyze_fixedparams at analyze.c:144:10
pg_analyze_and_rewrite_fixedparams at postgres.c:699:10
exec_simple_query at postgres.c:1206:20
PostgresMain at postgres.c:4860:27
BackendInitialize at backend_startup.c:142:1
postmaster_child_launch at launch_backend.c:269:3
BackendStartup at postmaster.c:3627:8
ServerLoop at postmaster.c:1731:10
PostmasterMain at postmaster.c:1415:11
main at main.c:236:2
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7b3d5c02a28b]
postgres: law regression [local] SELECT(_start+0x25)[0x5944d79e8155]
2026-06-17 07:26:07.914 EEST [85875:6] LOG: client backend (PID 87325) was terminated by signal 6: Aborted
Without asserts enables, the server might crash.
2) An issue in GetSnapshotData()
The following modification:
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -71,2 +71,3 @@
#include "utils/wait_event.h"
+#include "common/pg_prng.h"
@@ -2157,3 +2158,3 @@ GetSnapshotData(Snapshot snapshot)
Assert(snapshot->subxip == NULL);
- snapshot->subxip = (TransactionId *)
+ snapshot->subxip = (pg_prng_double(&pg_global_prng_state) < 0.01) ? NULL : (TransactionId *)
malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
makes this script (max_prepared_transactions = 2 in postgresql.conf):
for i in {1..1000}; do
cat << 'EOS' | psql >>psql.log
SELECT 1;
BEGIN;
CREATE TABLE t1(a int);
SAVEPOINT sp1;
INSERT INTO t1 VALUES (1);
ROLLBACK TO sp1;
INSERT INTO t1 VALUES (2);
PREPARE TRANSACTION 'pt1';
BEGIN;
CREATE TABLE t2(a int);
ROLLBACK;
ROLLBACK PREPARED 'pt1';
EOS
grep "terminated by signal" server.log && break;
done
trigger a segmentation fault:
2026-06-17 07:37:52.619 EEST [108789:3] [unknown] LOG: connection authorized: user=law database=regression
application_name=psql
2026-06-17 07:37:52.620 EEST [108789:4] psql LOG: statement: SELECT 1;
2026-06-17 07:37:52.620 EEST [108789:5] psql ERROR: out of memory
2026-06-17 07:37:52.620 EEST [108789:6] psql LOG: statement: BEGIN;
2026-06-17 07:37:52.620 EEST [108789:7] psql LOG: statement: CREATE TABLE t1(a int);
2026-06-17 07:37:52.621 EEST [108789:8] psql LOG: statement: SAVEPOINT sp1;
2026-06-17 07:37:52.621 EEST [108789:9] psql LOG: statement: INSERT INTO t1 VALUES (1);
2026-06-17 07:37:52.621 EEST [108789:10] psql LOG: statement: ROLLBACK TO sp1;
2026-06-17 07:37:52.621 EEST [108789:11] psql LOG: statement: INSERT INTO t1 VALUES (2);
2026-06-17 07:37:52.621 EEST [108789:12] psql LOG: statement: PREPARE TRANSACTION 'pt1';
2026-06-17 07:37:52.622 EEST [108789:13] psql LOG: statement: BEGIN;
2026-06-17 07:37:52.622 EEST [108789:14] psql LOG: statement: CREATE TABLE t2(a int);
2026-06-17 07:37:52.777 EEST [108710:6] LOG: client backend (PID 108789) was terminated by signal 11: Segmentation fault
2026-06-17 07:37:52.777 EEST [108710:7] DETAIL: Failed process was running: CREATE TABLE t2(a int);
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __memcpy_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:289
(gdb) bt
#0 __memcpy_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:289
#1 0x00005e857d7dc534 in GetSnapshotData (snapshot=0x5e857df07ae0 <CurrentSnapshotData>) at procarray.c:2297
#2 0x00005e857da9e452 in GetTransactionSnapshot () at snapmgr.c:331
#3 0x00005e857d823117 in PortalRunUtility (portal=0x5e85b337f3b0, pstmt=0x5e85b32fcd58, isTopLevel=true,
setHoldSnapshot=false, dest=0x5e85b32fd118, qc=0x7ffe081d5aa0) at pquery.c:1127
#4 0x00005e857d82343b in PortalRunMulti (portal=0x5e85b337f3b0, isTopLevel=true, setHoldSnapshot=false,
dest=0x5e85b32fd118,
altdest=0x5e85b32fd118, qc=0x7ffe081d5aa0) at pquery.c:1307
#5 0x00005e857d82289a in PortalRun (portal=0x5e85b337f3b0, count=9223372036854775807, isTopLevel=true, dest=0x5e85b32fd118,
altdest=0x5e85b32fd118, qc=0x7ffe081d5aa0) at pquery.c:784
(gdb) f 1
#1 0x00005e857d7dc534 in GetSnapshotData (snapshot=0x5e857df07ae0 <CurrentSnapshotData>) at procarray.c:2297
2297 memcpy(snapshot->subxip + subcount,
(gdb) p *snapshot
$1 = {snapshot_type = SNAPSHOT_MVCC, xmin = 745, xmax = 747, xip = 0x5e85b3329030, xcnt = 0, subxip = 0x0, subxcnt = 0,
suboverflowed = false, takenDuringRecovery = false, copied = false, curcid = 3, speculativeToken = 0, vistest = 0x0,
active_count = 0, regd_count = 0, ph_node = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0x0},
snapXactCompletionCount = 55}
3) An issue in StandbyAcquireAccessExclusiveLock()
No modification needed. Please try the attached TAP test on REL_17_STABLE.
It fails as below:
t/099_out_of_shared_memory.pl .. Bailout called. Further testing stopped: pg_ctl start failed
099_out_of_shared_memory_standby.log contains:
2026-06-17 07:53:03.237 EEST [167771] LOG: database system is ready to accept read-only connections
2026-06-17 07:53:03.240 EEST [167775] LOG: started streaming WAL from primary at 0/3000000 on timeline 1
2026-06-17 07:53:03.269 EEST [167774] FATAL: out of shared memory
2026-06-17 07:53:03.269 EEST [167774] HINT: You might need to increase "max_locks_per_transaction".
2026-06-17 07:53:03.269 EEST [167774] CONTEXT: WAL redo at 0/32218D8 for Standby/LOCK: xid 738 db 5 rel 17839
2026-06-17 07:53:03.269 EEST [167774] WARNING: you don't own a lock of type AccessExclusiveLock
2026-06-17 07:53:03.269 EEST [167774] LOG: RecoveryLockHash contains entry for lock no longer recorded by lock manager:
xid 738 database 5 relation 17839
TRAP: failed Assert("false"), File: "standby.c", Line: 1053, PID: 167774
ExceptionalCondition at assert.c:52:13
StandbyReleaseXidEntryLocks at standby.c:1056:8
StandbyReleaseAllLocks at standby.c:1116:3
ShutdownRecoveryTransactionEnvironment at standby.c:178:2
StartupProcExit at startup.c:208:1
shmem_exit at ipc.c:282:9
proc_exit_prepare at ipc.c:201:2
proc_exit at ipc.c:155:2
errfinish at elog.c:593:5
LockAcquireExtended at lock.c:1020:4
LockAcquire at lock.c:763:1
StandbyAcquireAccessExclusiveLock at standby.c:1026:10
standby_redo at standby.c:1175:35
ApplyWalRecord at xlogrecovery.c:2008:13
PerformWalRecovery at xlogrecovery.c:1835:8
StartupXLOG at xlog.c:5803:24
StartupProcessMain at startup.c:264:2
postmaster_child_launch at launch_backend.c:281:9
StartChildProcess at postmaster.c:3918:8
PostmasterMain at postmaster.c:1369:13
startup_hacks at main.c:219:1
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x71abb4e2a28b]
postgres: standby: startup recovering 000000010000000000000003(_start+0x25)[0x64d0286de095]
2026-06-17 07:53:03.279 EEST [167771] LOG: startup process (PID 167774) was terminated by signal 6: Aborted
Best regards,
Alexander
| Attachment | Content-Type | Size |
|---|---|---|
| 099_out_of_shared_memory.pl | application/x-perl | 904 bytes |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Jim Jones | 2026-06-17 06:00:56 | Re: [PoC] XMLCast (SQL/XML X025) |
| Previous Message | Richard Guo | 2026-06-17 05:51:45 | Re: assertion failure with unique index + partitioning + join |