Re: [HACKERS] Replication status in logical replication

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Vaishnavi Prabakaran <vaishnaviprabakaran(at)gmail(dot)com>, Daniel Gustafsson <daniel(at)yesql(dot)se>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Replication status in logical replication
Date: 2018-07-06 05:21:48
Message-ID: 20180706052148.GA776@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 05, 2018 at 05:13:27PM +0900, Michael Paquier wrote:
> This concerns as well v10, so that's not actually an open item...
> Well, it was an open item last year. The last set of patches is from
> Simon here:
> https://www.postgresql.org/message-id/CANP8%2BjLwgsexwdPkBtkN5kdHN5TwV-d%3Di311Tq_FdOmzJ8QyRQ%40mail.gmail.com

logical_repl_caught_up_v4alt2.patch is actually incorrect after I tested
the thing, and that logical_repl_caught_up_v4.patch gets the correct
call.

> Simon, do you feel confident with your patch? If yes, could you finish
> wrapping it? I am getting myself familiar with the problem as this has
> been around for some time now so I am reviewing the thing as well and
> then I can board the ship..

Okay, I have spent some time today looking at this patch, and the error
is very easy to reproduce once you do that in the TAP tests:
1) Stop and start once the publisher in one of the tests of
src/test/subscription.
2) Enforce wait_for_catchup() to check for state = 'streaming'.
And then you would see the tests waiting until timeout is reached and
then die.

I would be inclined to add those tests in the final patch, the
disadvantage being that 1) makes one of the test scripts a bit longer,
but it can reproduce the failures most of the time. Having 2) is
actually nice for physical replication as the tests in
src/test/recovery/ use wait_for_catchup() in various ways.

Some other notes about the patch:
- I switched the error message in WalSndLoop as mentioned upthread for
nodes catching up, aka no more "primary" but "upstream server".
- Added a note about using only GetFlushRecPtr in XLogSendLogical as
logical decoding cannot be used on standby nodes. If one day logical
decoding gets supports on standby then this would need an update as
well.

Does this look fine to all the folks involved in this thread? It is
Friday afternoon here so my brain is getting fried, but I can finish
wrapping up this patch at the beginning of next week if there are no
objections. At quick glance this indeed would need a backpatch down to
9.4 but I have not spent time testing those configurations yet.
--
Michael

Attachment Content-Type Size
logical_repl_caught_up_v5.patch text/x-diff 3.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2018-07-06 05:26:03 Re: using expression syntax for partition bounds
Previous Message Tom Lane 2018-07-06 04:09:37 Re: Legacy GiST invalid tuples