Re: [HACKERS] Restricting maximum keep segments by repslots

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: jgdr(at)dalibo(dot)com, andres(at)anarazel(dot)de, michael(at)paquier(dot)xyz, sawada(dot)mshk(at)gmail(dot)com, peter(dot)eisentraut(at)2ndquadrant(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, thomas(dot)munro(at)enterprisedb(dot)com, sk(at)zsrv(dot)org, michael(dot)paquier(at)gmail(dot)com
Subject: Re: [HACKERS] Restricting maximum keep segments by repslots
Date: 2020-04-27 22:33:42
Message-ID: 20200427223342.GA23152@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2020-Apr-08, Kyotaro Horiguchi wrote:

> At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
>
> Just avoiding starting replication when restart_lsn is invalid is
> sufficient (the attached, which is equivalent to a part of what the
> invalidated flag did). I thing that the error message needs a Hint but
> it looks on the subscriber side as:
>
> [22086] 2020-04-08 10:35:04.188 JST ERROR: could not receive data from WAL stream: ERROR: replication slot "s1" is invalidated
> HINT: The slot exceeds the limit by max_slot_wal_keep_size.
>
> I don't think it is not clean.. Perhaps the subscriber should remove
> the trailing line of the message from the publisher?

Thanks for the fix! I propose two changes:

1. reword the error like this:

ERROR: replication slot "regression_slot3" cannot be advanced
DETAIL: This slot has never previously reserved WAL, or has been invalidated

2. use the same error in one other place, to wit
pg_logical_slot_get_changes() and pg_replication_slot_advance(). I
made the DETAIL part the same in all places, but the ERROR line is
adjusted to what each callsite is doing.
I do think that this change in test_decoding is a bit unpleasant:

-ERROR: cannot use physical replication slot for logical decoding
+ERROR: cannot get changes from replication slot "repl"

The test is
-- check that we're detecting a streaming rep slot used for logical decoding
SELECT 'init' FROM pg_create_physical_replication_slot('repl');
SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');

> > On the other hand, physical replication doesn't break by invlidation.
> > [...]
>
> If we don't mind that standby can reconnect after a walsender
> termination due to the invalidation, we don't need to do something for
> this. Restricting max_slot_wal_keep_size to be larger than a certain
> threshold would reduce the chance we see that behavior.

Yeah, I think you're referring to the fact that StartReplication()
doesn't verify the restart_lsn of the slot; and if we do add a check, a
few tests that rely on physical replication start to fail. This patch
only adds a comment in that spot. But I don't (yet) know what the
consequences of this are, or whether it can be fixed by setting a valid
restart_lsn ahead of time. This test in pg_basebackup fails, for
example:

# Running: pg_basebackup -D /home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl -X stream -S slot1
pg_basebackup: error: could not send replication command "START_REPLICATION": ERROR: cannot read from replication slot "slot1"
DETAIL: This slot has never previously reserved WAL, or has been invalidated
pg_basebackup: error: child process exited with exit code 1
pg_basebackup: removing data directory "/home/alvherre/Code/pgsql-build/master/src/bin/pg_basebackup/tmp_check/tmp_test_EwIj/backupxs_sl"
not ok 95 - pg_basebackup -X stream with replication slot runs

# Failed test 'pg_basebackup -X stream with replication slot runs'
# at t/010_pg_basebackup.pl line 461.

Anyway I think the current patch can be applied as is -- and if we want
physical replication to have some other behavior, we can patch for that
afterwards.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
0001-check-slot-restart_lsn-in-a-couple-of-places.patch text/x-diff 5.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-04-27 23:01:38 Re: [BUG] non archived WAL removed during production crash recovery
Previous Message Darafei Komяpa Praliaskouski 2020-04-27 18:48:47 Re: Parallel GiST build on Cube