Re: Stopping logical replication protocol

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Vladimir Gordiychuk <folyga(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Álvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Stopping logical replication protocol
Date: 2016-09-23 05:01:27
Message-ID: CAMsr+YFgEBo=UQ23HssNkk6CF=Vnk=Uj_gTt3Ffyi0J=2DBCpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19 September 2016 at 07:12, Vladimir Gordiychuk <folyga(at)gmail(dot)com> wrote:
> New patch in attach. 0001 and 0002 without changes.
> 0003 - patch contain improvements for pg_recvloginca, now pg_recvloginca
> after receive SIGINT will send CopyDone package to postgresql and wait from
> server CopyDone. For backward compatible after repeat send SIGINT
> pg_recvloginca will continue immediately without wait CopyDone from server.
> 0004 patch contain regression tests on scenarios that fix 0001 and 0002
> patches.

Great.

Thanks for this.

First observation (which I'll fix in updated patch):

It looks like you didn't import my updated patches, so I've rebased
your new patches on top of them.

Whitespace issues:

$ git am ~/Downloads/0003-Add-ability-for-pg_recvlogical-safe-stop-replication.patch
Applying: Add ability for pg_recvlogical safe stop replication
/home/craig/projects/2Q/postgres/.git/rebase-apply/patch:140: indent
with spaces.
msgBuf + hdr_len + bytes_written,
/home/craig/projects/2Q/postgres/.git/rebase-apply/patch:550: indent
with spaces.
/* Backward compatible, allow force interrupt logical replication
/home/craig/projects/2Q/postgres/.git/rebase-apply/patch:551: indent
with spaces.
* after second SIGINT without wait CopyDone from server
/home/craig/projects/2Q/postgres/.git/rebase-apply/patch:552: indent
with spaces.
*/
warning: 4 lines add whitespace errors.

Remember to use "git log --check" before sending patches.

Also, comment style, please do

/* this */

or

/*
* this
*/

not

/* this
*/

I did't see you explain why this was removed:

- /* fast path */
- /* Try to flush pending output to the client */
- if (pq_flush_if_writable() != 0)
- WalSndShutdown();
-
- if (!pq_is_send_pending())
- return;

I fixed a warning introduced here:

pg_recvlogical.c: In function ‘ProcessXLogData’:
pg_recvlogical.c:289:2: warning: ISO C90 forbids mixed declarations
and code [-Wdeclaration-after-statement]
int bytes_left = msgLength - hdr_len;
^

Some of the changes to pg_recvlogical look to be copied from
receivelog.c, most notably the functions ProcessKeepalive and
ProcessXLogData . I thought that rather than copying them, shouldn't
the existing ones be modified to meet your needs? But it looks like
the issue is that a fair chunk of the rest of pg_recvlogical doesn't
re-use code from receivelog.c either; for example, pg_recvlogical's
sendFeedback differs from receivelog.c's sendFeedback. So
pg_recvlogical doesn't share the globals that receivelog.c assumes are
used. Also, what I thought were copied from receivelog.c were actually
extracted from code previously inline in StreamLogicalLog(...) in
pg_recvlogical.c .

I'm reluctant to try to untangle this in the same patch for risk that
it'll get stalled by issues with refactoring. The change you've
already made is a useful cleanup without messing with unnecessary
code.

Your patch 3 does something ... odd:

src/test/recovery/t/001_stream_rep.pl | 63
----------------------
src/test/recovery/t/002_archiving.pl | 53 -------------------
src/test/recovery/t/003_recovery_targets.pl | 146
---------------------------------------------------
src/test/recovery/t/004_timeline_switch.pl | 75
--------------------------
src/test/recovery/t/005_replay_delay.pl | 69
------------------------
src/test/recovery/t/006_logical_decoding.pl | 40 --------------
src/test/recovery/t/007_sync_rep.pl | 174
------------------------------------------------------------
src/test/recovery/t/previous/001_stream_rep.pl | 63
++++++++++++++++++++++
src/test/recovery/t/previous/002_archiving.pl | 53 +++++++++++++++++++
src/test/recovery/t/previous/003_recovery_targets.pl | 146
+++++++++++++++++++++++++++++++++++++++++++++++++++
src/test/recovery/t/previous/004_timeline_switch.pl | 75
++++++++++++++++++++++++++
src/test/recovery/t/previous/005_replay_delay.pl | 69
++++++++++++++++++++++++
src/test/recovery/t/previous/006_logical_decoding.pl | 40 ++++++++++++++
src/test/recovery/t/previous/007_sync_rep.pl | 174
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

so I've revised it to remove that whole change, which I think you
probably did not mean to commit.

Reworded commit message.

Ensured that we send feedback just before a client-initiated CopyDone
so server knows what position we saved up to. We don't discard already
buffered data

I've modified patch 3 so that it also flushes to disk and sends
feedback before sending CopyDone, then discards any xlog data received
after it sends CopyDone. This is helpful in ensuring that we get
predictable behaviour when cancelling pg_recvxlog and restarting it
because the client and server agree on the point at which replay
stopped.

I was evaluating the tests and I don't think they actually demonstrate
that the patch works. There's nothing done to check that
pg_recvlogical exited because of client-initiated CopyDone. While
looking into that I found that it also never actually hits
ProcessCopyDone(...) because libpq handles the CopyDone reply from the
server its self and treats it as end-of-stream. So the loop in
StreamLogicalLog will just end and ProcessCopyDone() is dead code.

Initialization of copyDoneSent and copyDoneReceived were also in the
wrong place and would've caused issues with retry looping.

I modified pg_recvlogical so that when in verbose mode it prints a
message when it exits by client request. Then modified the tests to
look for that message.

An updated patch series is attached. Please re-test and review my
changes. At that point I'll mark it ready to go unless someone else
wants to take a look. I'm pretty much out of time for this anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Respect-client-initiated-CopyDone-in-walsender.patch text/x-patch 3.5 KB
0002-Client-initiated-CopyDone-during-transaction-decodin.patch text/x-patch 11.3 KB
0003-Add-ability-for-pg_recvlogical-to-stop-replication-f.patch text/x-patch 14.8 KB
0004-Add-tests-for-stop-replication-protocol.patch text/x-patch 7.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tsunakawa, Takayuki 2016-09-23 05:04:28 Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Previous Message Michael Paquier 2016-09-23 04:56:36 Re: pg_ctl promote wait