Re: [HACKERS] Restricting maximum keep segments by repslots

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: alvherre(at)2ndquadrant(dot)com
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-04-28 04:58:57
Message-ID: 20200428.135857.1243584941650464602.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Mon, 27 Apr 2020 18:33:42 -0400, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in
> On 2020-Apr-08, Kyotaro Horiguchi wrote:
>
> > At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> >
> > Just avoiding starting replication when restart_lsn is invalid is
> > sufficient (the attached, which is equivalent to a part of what the
> > invalidated flag did). I thing that the error message needs a Hint but
> > it looks on the subscriber side as:
> >
> > [22086] 2020-04-08 10:35:04.188 JST ERROR: could not receive data from WAL stream: ERROR: replication slot "s1" is invalidated
> > HINT: The slot exceeds the limit by max_slot_wal_keep_size.
> >
> > I don't think it is not clean.. Perhaps the subscriber should remove
> > the trailing line of the message from the publisher?
>
> Thanks for the fix! I propose two changes:
>
> 1. reword the error like this:
>
> ERROR: replication slot "regression_slot3" cannot be advanced
> DETAIL: This slot has never previously reserved WAL, or has been invalidated

Agreed to describe what is failed rather than the cause. However,
logical replications slots are always "previously reserved" at
creation.

> 2. use the same error in one other place, to wit
> pg_logical_slot_get_changes() and pg_replication_slot_advance(). I
> made the DETAIL part the same in all places, but the ERROR line is
> adjusted to what each callsite is doing.
> I do think that this change in test_decoding is a bit unpleasant:
>
> -ERROR: cannot use physical replication slot for logical decoding
> +ERROR: cannot get changes from replication slot "repl"
>
> The test is
> -- check that we're detecting a streaming rep slot used for logical decoding
> SELECT 'init' FROM pg_create_physical_replication_slot('repl');
> SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');

The message may be understood as "No change has been made since
restart_lsn". Does something like the following work?

ERROR: replication slot "repl" is not usable to get changes

By the way there are some other messages that doesn't render the
symptom but the cause.

"cannot use physical replication slot for logical decoding"
"replication slot \"%s\" was not created in this database"

Don't they need the same amendment?

> > > On the other hand, physical replication doesn't break by invlidation.
> > > [...]
> >
> > If we don't mind that standby can reconnect after a walsender
> > termination due to the invalidation, we don't need to do something for
> > this. Restricting max_slot_wal_keep_size to be larger than a certain
> > threshold would reduce the chance we see that behavior.
>
> Yeah, I think you're referring to the fact that StartReplication()
> doesn't verify the restart_lsn of the slot; and if we do add a check, a
> few tests that rely on physical replication start to fail. This patch
> only adds a comment in that spot. But I don't (yet) know what the
> consequences of this are, or whether it can be fixed by setting a valid
> restart_lsn ahead of time. This test in pg_basebackup fails, for
> example:
>
> # Running: pg_basebackup -D /home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl -X stream -S slot1
> pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR: cannot read from replication slot "slot1"
> DETAIL: This slot has never previously reserved WAL, or has been invalidated
> pg_basebackup: error: child process exited with exit code 1
> pg_basebackup: removing data directory "/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl"
> not ok 95 - pg_basebackup -X stream with replication slot runs
>
> # Failed test 'pg_basebackup -X stream with replication slot runs'
> # at t/010_pg_basebackup.pl line 461.
>
>
> Anyway I think the current patch can be applied as is -- and if we want
> physical replication to have some other behavior, we can patch for that
> afterwards.

Agreed here. The false-invalidation doesn't lead to any serious
consequences.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2020-04-28 05:19:09 Re: Fix compilation failure against LLVM 11
Previous Message Michael Paquier 2020-04-28 04:56:23 Re: Fix compilation failure against LLVM 11