Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: alex(at)altmetric(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached
Date: 2021-12-10 08:41:56
Message-ID: CAD21AoAQtdXm_ijArUew69X178fWp5HoTF8+CQksWuxZSCeHWg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Wed, Dec 8, 2021 at 9:58 PM PG Bug reporting form
<noreply(at)postgresql(dot)org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference: 17327
> Logged by: Alex E
> Email address: alex(at)altmetric(dot)com
> PostgreSQL version: 13.5
> Operating system: Ubuntu 18.04
> Description:
>
> We have recently run into a situation where our pg_basebackup-based backups
> started failing unexpectedly. These use WAL streaming to keep up with
> changes (which uses a temporary replication slot server side).
>
> The only errors logged on the client side were as listed below:
>
> pg_basebackup: error: could not receive data from WAL stream: SSL connection
> has been closed unexpectedly
> pg_basebackup: error: could not read COPY data: server closed the connection
> unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> pg_basebackup: removing contents of data directory "/backups/some/path/"
>
> whilst on the server side we only got:
>
> 2021-12-03 16:21:54 UTC [29724-2647] LOG: terminating process 42601 to
> release replication slot "pg_basebackup_42601"
> 2021-12-03 16:21:54 UTC [42601-1] replicator(at)[unknown] FATAL: terminating
> connection due to administrator command
> 2021-12-03 16:21:54 UTC [42601-2] replicator(at)[unknown] STATEMENT:
> START_REPLICATION SLOT "pg_basebackup_42601" 4721F/45000000 TIMELINE 3
>
> The above was very unhelpful as it made us believe we might be dealing with
> either a network interruption or another type of mysterious hardware
> error.
>
> We then proceeded to try several things to try and determine the root cause
> of the problem and eventually realized (by trial and error and monitoring
> various statistics) that we were breaching our max_slot_wal_keep_size limit
> for the temporary replication slot whilst taking the pg_basebackup. The only
> way we realized this was by using a permanent physical replication slot to
> take the backup instead of a temporary one, and when doing this a relevant
> error related to max_slot_wal_keep_size being breached was issued.
>
> The core issue here then in our opinion is that Postgres server should log
> an error when the max_slot_wal_keep_size limit is reached for temporary
> replication slots as well as for permanent ones as otherwise
> users/administrators are presented only with non-descript connection
> termination errors which do not point to the actual cause of the problem.

IIUC the LOG message you got, "terminating process 42601 to release
replication slot "pg_basebackup_42601"", is the message you want. That
is, it's emitted when the replication slot reached
max_slot_wal_keep_size and was invalidated. I don't think such
behavior varies depending on persistent slots and temporary slots. If
the message "terminating process XXXX to release replication slot
"YYY"" is not clear to indicate that the replication slot is
invalidated, probably we can improve the message or add details.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message James Pang (chaolpan) 2021-12-10 09:05:19 RE: BUG #17326: Postgres crashed when pg_reload_conf() with ssl certificate parameters
Previous Message Masahiko Sawada 2021-12-10 06:06:11 Re: BUG #17296: replication slot self-removed after created