Re: [HACKERS] Restricting maximum keep segments by repslots

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: alvherre(at)2ndquadrant(dot)com
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-04-07 07:30:43
Message-ID: 20200407.163043.2050717072576572791.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 07 Apr 2020 12:09:05 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > it seems to me that it suffices to check restart_lsn for being invalid
> > in the couple of places where the slot's owner advances (which is the
> > two auxiliary functions for ProcessStandbyReplyMessage). I have done so
> > in the attached. There are other places where the restart_lsn is set,
> > but those seem to be used only when the slot is created. I don't think
> > we need to cover for those, but I'm not 100% sure about that.
>
> StartLogicalReplcation does
> "XLogBeginRead(,MyReplicationSlot->data.restart_lsn)". If the
> restart_lsn is invalid, following call to XLogReadRecord runs into
> assertion failure. Walsender (or StartLogicalReplication) should
> correctly reject reconnection from the subscriber if restart_lsn is
> invalid.
>
> > However, the change in PhysicalConfirmReceivedLocation() breaks
> > the way slots work for pg_basebackup: apparently the slot is created
> > with a restart_lsn of Invalid and we only advance it the first time we
> > process a feedback message from pg_basebackup. I have a vague feeling
> > that that's bogus, but I'll have to look at the involved code a little
> > bit more closely to be sure about this.
>
> Mmm. Couldn't we have a new member 'invalidated' in ReplicationSlot?

I did that in the attached. The invalidated is shared-but-not-saved
member of a slot and initialized to false then irreversibly changed to
true when the slot loses required segment.

It is checked by the new function CheckReplicationSlotInvalidated() at
acquireing a slot and at updating slot by standby reply message. This
change stops walsender without explicitly killing but I didn't remove
that code.

When logical slot loses segment, the publisher complains as:

[backend ] LOG: slot "s1" is invalidated at 0/370001C0 due to exceeding max_slot_wal_keep_size
[walsender] FATAL: terminating connection due to administrator command

The subscriber tries to reconnect and that fails as follows:

[19350] ERROR: replication slot "s1" is invalidated
[19352] ERROR: replication slot "s1" is invalidated
...

If the publisher restarts, the message is not seen and see the
following instead.

[19372] ERROR: requested WAL segment 000000010000000000000037 has already been removed

The check is done at ReplicationSlotAcquire, some slot-related SQL
functions are affected.

=# select pg_replication_slot_advance('s1', '0/37000000');
ERROR: replication slot "s1" is invalidated

After restarting the publisher, the message changes as the same with
walsender.

=# select pg_replication_slot_advance('s1', '0/380001C0');
ERROR: requested WAL segment pg_wal/000000010000000000000037 has already been removed

Since I didn't touch restart_lsn at all so no fear for changing other
behavior inadvertently.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
0001-further-change-type-2.patch text/x-patch 6.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro Horiguchi 2020-04-07 07:38:17 Re: shared-memory based stats collector
Previous Message Pavel Stehule 2020-04-07 07:29:58 Re: proposal \gcsv