Re: Synchronous replay take III

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronous replay take III
Date: 2018-12-28 03:34:01
Message-ID: CAEepm=1NtapgoR=5xz67N1ck3ZtWcVPj2rV24vybAueyKZ8YpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Here is a rebased patch, and separate replies to Michael and Michail.

On Sat, Dec 1, 2018 at 4:57 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Sat, Dec 01, 2018 at 02:48:29PM +1300, Thomas Munro wrote:
> > Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
> > top of those, I found myself wondering why syncrep.c thinks it needs
> > special treatment for postmaster death. I don't see any reason why we
> > shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
> > this new version. If you kill -9 the postmaster, I don't see any
> > reason to think that the existing coding is more correct than simply
> > exiting immediately.
>
> Hm. This stuff runs under many assumptions, so I think that we should
> be careful here with any changes as the very recent history has proved
> (4c70336). If we were to switch WAL senders on postmaster death, I
> think that this could be a change independent of what is proposed here.

Fair point. I think the effect should be the same with less code:
either way you see the server hang up without sending a COMMIT tag,
but maybe I'm missing something. Change reverted; let's discuss that
another time.

On Mon, Dec 3, 2018 at 9:01 AM Michail Nikolaev
<michail(dot)nikolaev(at)gmail(dot)com> wrote:
> It is really nice feature. I am working on the project which heavily reads from replicas (6 of them).

Thanks for your feedback.

> In our case we have implemented some kind of "replication barrier" functionality based on table with counters (one counter per application backend in simple case).
> Each application backend have dedicated connection to each replica. And it selects its counter value few times (2-100) per second from each replica in background process (depending on how often replication barrier is used).

Interesting approach. Why don't you sample pg_last_wal_replay_lsn()
on all the standbys instead, so you don't have to generate extra write
traffic?

> Once application have committed transaction it may want join replication barrier before return new data to a user. So, it increments counter in the table and waits until all replicas have replayed that value according to background monitoring process. Of course timeout, replicas health checks and few optimizations and circuit breakers are used.

I'm interested in how you handle failure (taking too long to respond
or to see the new counter value, connectivity failure etc).
Specifically, if the writer decides to give up on a certain standby
(timeout, circuit breaker etc), how should a client that is connected
directly to that standby now or soon afterwards know that this standby
has been 'dropped' from the replication barrier and it's now at risk
of seeing stale data? My patch handles this by cancelling standbys'
leases explicitly and waiting for a response (if possible), but
otherwise waiting for them to expire (say if connectivity is lost or
standby has gone crazy or stopped responding), so that there is no
scenario where someone can successfully execute queries on a standby
that hasn't applied a transaction that you know to be committed on the
primary.

> Nice thing here - constant number of connection involved. Even if lot of threads joining replication barrier in the moment. Even if some replicas are lagging.
>
> Because 2-5 seconds lag of some replica will lead to out of connections issue in few milliseconds in case of implementation described in this thread.

Right, if a standby is lagging more than the allowed amount, in my
patch the lease is cancelled and it will refuse to handle requests if
the GUC is on, with a special new error code, and then it's up to the
client to decide what to do. Probably find another node.

> It may be the weak part of the patch I think. At least for our case.

Could you please elaborate? What could you do that would be better?
If the answer is that you just want to know that you might be seeing
stale data but for some reason you don't want to have to find a new
node, the reader is welcome to turn synchronous_standby off and try
again (giving up data freshness guarantees). Not sure when that would
be useful though.

> But it possible could be used to eliminate odd table with counters in my case (if it possible to change setting per transaction).

Yes, the behaviour can be activated per transaction, using the usual
GUC scoping rules. The setting synchronous_replay must be on in both
the write transaction and the following read transaction for the logic
to work (ie for the writer to wait, and for the reader to make sure
that it has a valid lease or raise an error).

It sounds like my synchronous_replay GUC is quite similar to your
replication barrier system, except that it has a way to handle node
failure and excessive lag without abandoning the guarantee.

I've attached a small shell script that starts up a primary and N
replicas with synchronous_replay configured, in the hope of
encouraging you to try it out.

--
Thomas Munro
http://www.enterprisedb.com

Attachment Content-Type Size
0001-Synchronous-replay-mode-for-avoiding-stale-reads-v10.patch application/octet-stream 82.3 KB
test-synchronous-replay.sh application/x-sh 1.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-12-28 03:36:27 Re: Alter table documentation page (again)
Previous Message Alvaro Herrera 2018-12-28 03:05:34 Re: removal of dangling temp tables