Re: WAL segments removed from primary despite the fact that logical replication slot needs it.

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: depesz(at)depesz(dot)com
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date: 2022-11-15 02:20:19
Message-ID: CAD21AoDHj2RTEjVdqNTVPKyXqjjyUXpyA+=o8UMFEN-vTU848g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Nov 11, 2022 at 11:50 PM hubert depesz lubaczewski
<depesz(at)depesz(dot)com> wrote:
>
> 2022-11-10 20:59:55 UTC 103A7/7B666388 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/8DECBA18
> 2022-11-10 21:00:54 UTC 103A7/88225C28 focal14 pgoutput logical 16607 canvas f t 10414 \N 3241434855 1039D/83825958 1039D/911A8DB0
> 2022-11-10 21:01:53 UTC 103A7/9E77EF90 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
> 2022-11-10 21:02:52 UTC 103A7/B45ECEA8 focal14 pgoutput logical 16607 canvas f f \N \N 3241434855 1039D/83825958 1039D/911A8DB0
> 2022-11-10 21:03:52 UTC 103A7/C4355F60 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/955633D0
> 2022-11-10 21:04:51 UTC 103A7/D5A6FDD8 focal14 pgoutput logical 16607 canvas f t 21748 \N 3241443528 1039D/83825958 1039D/96453F38
> 2022-11-10 21:05:50 UTC 103A7/E2C7F590 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> 2022-11-10 21:06:49 UTC 103A7/F28B2000 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> 2022-11-10 21:07:48 UTC 103A8/3828D20 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38
> 2022-11-10 21:08:47 UTC 103A8/15BBF088 focal14 pgoutput logical 16607 canvas f f \N \N 3241443528 1039D/83825958 1039D/96453F38

It happened twice that the walsender process was active for at least
one minute while not advancing restart_lsn and exited. Could you share
the server logs from around 2022-11-10 20:59:55 to 21:08:47 so that we
can find out what the wal sender processes were doing?

Also, you may have logs for "failed to increase restart lsn: proposed
%X/%X, after %X/%X, current candidate %X/%X, current after %X/%X,
flushed up to %X/%X" and "got new restart lsn %X/%X at %X/%X". If so,
please share it as well.

I've tried to reproduce this issue with a similar setting in a loop
but could not.

Given that you have the log "attempting to remove WAL segments older
than log file 000000000001039D0000008A", the last WAL file we need to
retain calculated by KeepLogSeg() was wrong.

KeepLogSeg() retreated the last segment that we need to retain while
considering replication slots' minimum LSN (and wal_keep_segments) but
it seemed not to work fine. The possibilities are (1) we got the wrong
slots' minimum LSN that is newer than a slot's LSN (ie, retreated
logSegNo but it's still newer than required WAL files) or (2) we got
an invalid slots's minimum LSN (ie, didn't retreated at all). If we
know what the RedoRecPtr was, we can calculate _logSegNo that has not
been retreated by KeepLogSeg().

BTW in LogicalConfirmReceivedLocation(), we set restart_lsn as follows:

if (MyReplicationSlot->candidate_restart_valid != InvalidXLogRecPtr &&
MyReplicationSlot->candidate_restart_valid <= lsn)
{
Assert(MyReplicationSlot->candidate_restart_lsn !=
InvalidXLogRecPtr);

MyReplicationSlot->data.restart_lsn =
MyReplicationSlot->candidate_restart_lsn;
MyReplicationSlot->candidate_restart_lsn = InvalidXLogRecPtr;
MyReplicationSlot->candidate_restart_valid = InvalidXLogRecPtr;
updated_restart = true;
}

Is it worth changing the assertion to elog(ERROR) so that we can
detect that an invalid value is set to restart_lsn?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2022-11-15 04:51:18 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Previous Message hubert depesz lubaczewski 2022-11-14 16:51:40 Re: WAL segments removed from primary despite the fact that logical replication slot needs it.