ReplicationSlotRelease may set the statusFlags of other processes in PG14

From: feichanghong <feichanghong(at)qq(dot)com>
To: pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Cc: andres <andres(at)anarazel(dot)de>, sawada(dot)mshk <sawada(dot)mshk(at)gmail(dot)com>, horikyota(dot)ntt <horikyota(dot)ntt(at)gmail(dot)com>
Subject: ReplicationSlotRelease may set the statusFlags of other processes in PG14
Date: 2024-03-16 14:29:03
Message-ID: tencent_CA7420C82971BC4B64F0748A7D2898A4720A@qq.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello hackers,

I encountered an issue where an assertion failure occurred for
ProcGlobal-&gt;statusFlags. The detailed analysis is as follows:

# Issue background

PostgreSQL version: REL_14_STABLE v14.11 3621ffd9f2

The assert failure occurs when the autovacuum worker process commits a
transaction: proc-&gt;statusFlags is 3, and ProcGlobal-&gt;statusFlags[proc-&gt;pgxactoff]
is 0. The specific stack trace is as follows:
```
#0&nbsp; 0x00007f174c401277 in raise () from /lib64/libc.so.6
#1&nbsp; 0x00007f174c402968 in abort () from /lib64/libc.so.6
#2&nbsp; 0x00000000010f1403 in ExceptionalCondition (conditionName=0x1728760 "proc-&gt;statusFlags == ProcGlobal-&gt;statusFlags[proc-&gt;pgxactoff]", errorType=0x1728420 "FailedAssertion", fileName=0x17284b8 "procarray.c", lineNumber=776) at assert.c:44
#3&nbsp; 0x0000000000e997d0 in ProcArrayEndTransaction (proc=0x7f1731630f00, latestXid=0) at procarray.c:776
#4&nbsp; 0x00000000008cfb18 in CommitTransaction () at xact.c:2447
#5&nbsp; 0x00000000008d090e in CommitTransactionCommand () at xact.c:3293
#6&nbsp; 0x0000000000bb5652 in vacuum_rel (relid=2608, relation=0x56f76f8, params=0x56e399c) at vacuum.c:2174
#7&nbsp; 0x0000000000bb300e in vacuum (relations=0x5535b40, params=0x56e399c, bstrategy=0x56e3a30, isTopLevel=true) at vacuum.c:486
#8&nbsp; 0x0000000000dc01a9 in autovacuum_do_vac_analyze (tab=0x56e3998, bstrategy=0x56e3a30) at autovacuum.c:3467
#9&nbsp; 0x0000000000dbeb94 in do_autovacuum () at autovacuum.c:2632
#10 0x0000000000dbd530 in AutoVacWorkerMain (argc=0, argv=0x0) at autovacuum.c:1769
#11 0x0000000000dbd12a in StartAutoVacWorker () at autovacuum.c:1545
#12 0x0000000000dd894f in StartAutovacuumWorker () at postmaster.c:6441
#13 0x0000000000dd8189 in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:6108
#14 <signal handler called&gt;
#15 0x00007f174c4bffc3 in __select_nocancel () from /lib64/libc.so.6
#16 0x0000000000dd1d67 in ServerLoop () at postmaster.c:1881
#17 0x0000000000dd16d9 in PostmasterMain (argc=3, argv=0x533e060) at postmaster.c:1582
#18 0x0000000000ca4af5 in main (argc=3, argv=0x533e060) at main.c:216
```

# Analysis and Reproduction
## Analysis

A process utilizing replication slots (usually walsender) calls callback
functions in the order of RemoveProcFromArray-&gt;ProcKill upon abnormal exit.
Within RemoveProcFromArray, MyProc is already removed from the ProcArray.
ProcKill then attempts to set ProcGlobal-&gt;statusFlags[MyProc-&gt;pgxactoff] again
via ReplicationSlotRelease. By this time, the flag may already be assigned to
another process.

## Reproduction

This issue only occurs in PG14. In versions prior to PG14, ProcGlobal-&gt;statusFlags
had not been refactored. In versions following PG14, it has been fixed by commit
2f6501f (discussed in [1]). Therefore, I've cc'd Masahiko, Kyotaro, and Andres in
hopes that they can provide some insights.

To replicate the issue, execute the following steps:
1. Apply the attached v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch,
&nbsp; &nbsp;where pgxactoff is set to an invalid value in ProcArrayRemove, and some
&nbsp; &nbsp;checks are added.
2. Use the SQL below to terminate the walsender process.
```
select pg_terminate_backend(pid) from pg_stat_activity where backend_type = 'walsender';
```

# Fix

To fix the issue, I have provided some patches in the attachment:
1. Backpatching 2f6501f into the PG14 version will fix the problem.
2. In PG14-head, ProcArrayRemove needs to reset pgxactoff, and some assert
&nbsp; &nbsp;checks should be done when setting ProcGlobal-&gt;statusFlags.

[1] https://www.postgresql.org/message-id/29273AF3-9684-4069-9257-D05090B03B99@amazon.com

Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

&nbsp;

Attachment Content-Type Size
v1-0000-head-invalidate-pgxactoff-after-remove-pgproc.patch application/octet-stream 5.4 KB
v1-0000-v14-backpatch-2f6501f.patch application/octet-stream 5.2 KB
v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch application/octet-stream 4.9 KB
v1-0000-v15-v16-invalidate-pgxactoff-after-remove-pgproc.patch application/octet-stream 5.4 KB

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Wolfgang Walther 2024-03-16 15:55:49 Re: Regression tests fail with musl libc because libpq.so can't be loaded
Previous Message Tom Lane 2024-03-16 14:24:43 Re: Regression tests fail with musl libc because libpq.so can't be loaded