Re: Network failure may prevent promotion

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Network failure may prevent promotion
Date: 2024-01-18 08:26:31
Message-ID: 20240118.172631.1740094280436463079.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Sun, 31 Dec 2023 20:07:41 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> We've noticed that when walreceiver is waiting for a connection to
> complete, standby does not immediately respond to promotion
> requests. In PG14, upon receiving a promotion request, walreceiver
> terminates instantly, but in PG16, it waits for connection
> timeout. This behavior is attributed to commit 728f86fec65, where a
> part of libpqrcv_connect was simply replaced with a call to
> libpqsrc_connect_params. This behavior can be verified by simply
> dropping packets from the standby to the primary.

Apologize for the inconvenience on my part, but I need to fix this
behavior. To continue this discussion, I'm providing a repro script
here.

With the script, the standby is expected to promote immediately,
emitting the following log lines:

standby.log:
> 2024-01-18 16:25:22.245 JST [31849] LOG: received promote request
> 2024-01-18 16:25:22.245 JST [31850] FATAL: terminating walreceiver process due to administrator command
> 2024-01-18 16:25:22.246 JST [31849] LOG: redo is not required
> 2024-01-18 16:25:22.246 JST [31849] LOG: selected new timeline ID: 2
> 2024-01-18 16:25:22.274 JST [31849] LOG: archive recovery complete
> 2024-01-18 16:25:22.275 JST [31847] LOG: checkpoint starting: force
> 2024-01-18 16:25:22.277 JST [31846] LOG: database system is ready to accept connections
> 2024-01-18 16:25:22.280 JST [31847] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.005 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/1548E98, redo lsn=0/1548E40
> 2024-01-18 16:25:22.356 JST [31846] LOG: received immediate shutdown request
> 2024-01-18 16:25:22.361 JST [31846] LOG: database system is shut down

After 728f86fec65 was introduced, promotion does not complete with the
same operation, as follows. The patch attached to the previous mail
fixes this behavior to the old behavior above.

> 2024-01-18 16:47:53.314 JST [34515] LOG: received promote request
> 2024-01-18 16:48:03.947 JST [34512] LOG: received immediate shutdown request
> 2024-01-18 16:48:03.952 JST [34512] LOG: database system is shut down

The attached script requires that sudo is executable. And there's
another point to note. The script attempts to establish a replication
connection to $primary_address:$primary_port. To packet-filter can
work, it must be a remote address that is accessible when no
packet-filter setting is set up. The firewall-cmd setting, need to be
configured to block this connection. If simply an inaccessible IP
address is set, the process will fail immediately with a "No route to
host" error before the first packet is sent out, and it will not be
blocked as intended.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
promote_test.pl text/plain 1.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-01-18 08:27:55 Re: Build versionless .so for Android
Previous Message Anthonin Bonnefoy 2024-01-18 08:25:16 Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx