Re: Synchronous replay take III

From: Michail Nikolaev <michail(dot)nikolaev(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronous replay take III
Date: 2019-01-30 14:09:48
Message-ID: CANtu0ojaMC8RoPUjtpv1RVHveCGM1iwpPVMiXLYRnC2BdhXKKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello,

Sorry, missed email.

>> In our case we have implemented some kind of "replication barrier"
functionality based on table with counters (one counter per application
backend in simple case).
>> Each application backend have dedicated connection to each replica. And
it selects its counter value few times (2-100) per second from each replica
in background process (depending on how often replication barrier is used).

> Interesting approach. Why don't you sample pg_last_wal_replay_lsn()
> on all the standbys instead, so you don't have to generate extra write
> traffic?

Replay lsn was the first approach I tried. I was sampling 'select
replay_lsn from pg_stat_replication' on master to get info about replay
position on replicas.
However, for some unknown reason I was not able to get it to work. Because
after replay_lsn was reached - standby was unable to see the data.
I know it should not happen. I spend few days on debugging... And… Since I
was required to ping replicas anyway (to check if it is a master already,
monitor ping, locks, connections, etc.) - I have decided to introduce table
for now.

>> Once application have committed transaction it may want join replication
barrier before return new data to a user. So, it increments counter in the
table and waits until all replicas have replayed that value according to
background monitoring process. Of course timeout, replicas health checks
and few optimizations and circuit breakers are used.

> I'm interested in how you handle failure (taking too long to respond
> or to see the new counter value, connectivity failure etc).
> Specifically, if the writer decides to give up on a certain standby
> (timeout, circuit breaker etc), how should a client that is connected
> directly to that standby now or soon afterwards know that this standby
> has been 'dropped' from the replication barrier and it's now at risk
> of seeing stale data?

Each standby has some health flags attached to it. Health is "red" when:
* can't connect to replica, or all connections are in use
* replica lag according to pg_last_xact_replay_timestamp is more than 3000ms
* replica lag according to pg_last_xact_replay_timestamp was more than
3000ms some time ago (10000ms)
* replica is new master now
* etc.

In case of replication barrier, we are waiting only for "green" replicas
and max for 5000ms. If we still no able to see new counter value on some
replicas - it is up to client to decide how to process it. In our case, it
means replica is lagging more than 3000ms - so it is "red" now and next
client request will dispatched to another "green" replica. It is done by
special connection pool with balancer inside.
Not sure it is all 100% correct, but we could just proceed in our case.

> someone can successfully execute queries on a standby
> that hasn't applied a transaction that you know to be committed on the
> primary.

>> Nice thing here - constant number of connection involved. Even if lot of
threads joining replication barrier in the moment. Even if some replicas
are lagging.
>>
>> Because 2-5 seconds lag of some replica will lead to out of connections
issue in few milliseconds in case of implementation described in this
thread.

> Right, if a standby is lagging more than the allowed amount, in my
> patch the lease is cancelled and it will refuse to handle requests if
> the GUC is on, with a special new error code, and then it's up to the
> client to decide what to do. Probably find another node.
> In case of high loded

> Could you please elaborate? What could you do that would be better?
> If the answer is that you just want to know that you might be seeing
> stale data but for some reason you don't want to have to find a new
> node, the reader is welcome to turn synchronous_standby off and try
> again (giving up data freshness guarantees). Not sure when that would
> be useful though.

Main problem I see here - is master connection usage. We have about 10.000
RPS on master. So, small lag on some replica (we have six of them) will
lead all master connections to be waiting for replay on stale replica until
timeout. It is out of service for whole system. Even if it lagged for
200-300ms (in real world it could lag for seconds on regular basis).

If we set synchronous_replay_max_lag to 10-20ms - standbys will be
cancelled all the time.
In our case, we are using constant amount of connections involved. In
addition, client requests are waiting for standby replay
inside application backend thread without blocking master connection. This
is the main difference as I think.

> I've attached a small shell script that starts up a primary and N
> replicas with synchronous_replay configured, in the hope of
> encouraging you to try it out.

Thanks - will try and report.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2019-01-30 14:32:28 Re: Replication & recovery_min_apply_delay
Previous Message Christoph Berg 2019-01-30 13:41:01 Re: [PATCH] Pass COPT and PROFILE to CXXFLAGS as well