Re: Synchronous replay take III

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Synchronous replay take III
Date: 2019-01-15 10:16:54
Message-ID: CAD21AoBcvB-5B_w4n-dr90_BEuXcJoKb2KvWpGC_CrV1zcmM+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Dec 1, 2018 at 10:49 AM Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>
> On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> > Unfortunately, cfbot says that patch can't be applied without conflicts, could
> > you please post a rebased version and address commentaries from Masahiko?
>
> Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
> top of those, I found myself wondering why syncrep.c thinks it needs
> special treatment for postmaster death. I don't see any reason why we
> shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
> this new version. If you kill -9 the postmaster, I don't see any
> reason to think that the existing coding is more correct than simply
> exiting immediately.
>
> On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
> > <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > > I was pinged off-list by a fellow -hackers denizen interested in the
> > > synchronous replay feature and wanting a rebased patch to test. Here
> > > it goes, just in time for a Commitfest. Please skip to the bottom of
> > > this message for testing notes.
> >
> > Thank you for working on this. The overview and your summary was
> > helpful for me to understand this feature, thank you. I've started to
> > review this patch for PostgreSQL 12. I've tested this patch and found
> > some issue but let me ask you questions about the high-level design
> > first. Sorry if these have been already discussed.
>
> Thanks for your interest in this work!
>
> > > This is a design choice favouring read-mostly workloads at the expense
> > > of write transactions. Hot standbys' whole raison for existing is to
> > > move *some* read-only workloads off the primary server. This proposal
> > > is for users who are prepared to trade increased primary commit
> > > latency for a guarantee about visibility on the standbys, so that
> > > *all* read-only work could be moved to hot standbys.
> >
> > To be clear what did you mean read-mostly workloads?
>
> I mean workloads where only a small percentage of transactions perform
> a write. If you need write-scalability, then hot_standby is not the
> solution for you (with or without this patch).
>
> The kind of user who would be interested in this feature is someone
> who already uses some kind of heuristics to move some queries to
> read-only standbys. For example, some people send transaction for
> logged-in users to the primary database (because only logged-in users
> generate write queries), and all the rest to standby servers (for
> example "public" users who can only read content). Another technique
> I have seen is to keep user sessions "pinned" on the primary server
> for N minutes after they perform a write transaction. These types of
> load balancing policies are primitive ways of achieving
> read-your-writes consistency, but they are conservative and
> pessimistic: they probably send too many queries to the primary node.
>
> This proposal is much more precise, allowing you to run the minimum
> number of transactions on the primary node (ie transactions that
> actually need to perform a write), and the maximum number of
> transactions on the hot standbys.
>
> As discussed, making reads wait for a token would be a useful
> alternative (and I am willing to help make that work too), but:
>
> 1. For users that do more many more reads than writes, would you
> rather make (say) 80% of transactions slower or 20%? (Or 99% vs 1% as
> the case may be, depending on your application.)
>
> 2. If you are also using synchronous_commit = on for increased
> durability, then you are already making writers wait, and you might be
> able to tolerate a small increase.
>
> Peter Eisentraut expressed an interesting point of view against this
> general line of thinking:
>
> https://www.postgresql.org/message-id/5643933F.4010701%40gmx.net
>
> My questions are: Why do we have hot_standby mode? Is load balancing
> a style of usage we want to support? Do we want a technology that
> lets you do more of it?
>
> > I think there are two kind of reads on standbys: a read happend after
> > writes and a directly read (e.g. reporting). The former usually
> > requires the causal reads as you mentioned in order to read its own
> > writes but the latter might be different: it often wants to read the
> > latest data on the master at the time. IIUC even if we send a
> > read-only query directly to a synchronous replay server we could get a
> > stale result if the standby delayed for less than
> > synchronous_replay_max_lag. So this synchronous replay feature would
> > be helpful for the former case(i.e. a few writes and many reads wants
> > to see them) whereas for the latter case perhaps the keeping the reads
> > waiting on standby seems a reasonable solution.
>
> I agree 100% that this is not a solution for all users. But I also
> suspect a token system would be quite complicated, and can't be done
> in a way that is transparent to applications without giving up
> performance advantages. I wrote about my understanding of the
> trade-offs here:
>
> https://www.postgresql.org/message-id/CAEepm%3D0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ%40mail.gmail.com

Thank you for explaning. I understood the use-cases of this feature
and token-based causal reads.

>
> > Also I think it's worth to consider the cost both causal reads *and*
> > non-causal reads.
> >
> > I've considered a mixed workload (transactions requiring causal reads
> > and transactions not requiring it) on the current design. IIUC the
> > current design seems like that we create something like
> > consistent-reads group by specifying servers. For example, if a
> > transaction doesn't want to causality read it can send query any
> > server with synchronous_replay = off but if it wants, it should select
> > a synchronous replay server. It also means that client applications or
> > routing middlewares such as pgpool is required to be aware of
> > available synchronous replay standbys. That is, this design would cost
> > the read-only transactions requiring causal reads. On the other hand,
> > in token-based causal reads we can send read-only query any standbys
> > if we can wait for the change to be replayed. Of course if we don't
> > wait forever we can timeout and switch to either another standby or
> > the master to execute query but we don't need to choose a server of
> > standby servers.
>
> Yeah. I think tools like pgpool that already know how to connect to
> the primary and look at pg_stat_replication could use the new column
> to learn which servers support synchronous replay, for routing
> purposes. I also think that existing read/write load balancing tools
> for Python (eg "django-balancer"), Ruby (eg "makara"), Java could be
> adjusted to work with this quite easily.

Agreed.

>
> In response to a general question from Simon Riggs at a conference
> about how anyone is supposed to use this thing in real life, I wrote a
> proof-of-concept Java Spring application that shows the techniques
> that I think are required to make good use of it:
>
> https://github.com/macdice/syncreplay-spring-demo
>
> 1. Use a transaction management library (this includes Python Django
> transaction management, Ruby ActiveRecord IIUC, Java Spring
> declarative transactions, ...), so that whole transactions can be
> retried automatically. This is generally a good idea anyway because
> it lets you retry automatically on serialisation failures and deadlock
> errors. The new error 40P02
> ERRCODE_T_R_SYNCHRONOUS_REPLAY_NOT_AVAILABLE is just another reason to
> retry, in SQL error code class "40" (or perhaps is should be "72"... I
> have joked that the new error could be called "snapshot too young"!)
>
> 2. Classify transactions (= blocks of code that run a transaction) as
> read-write or read-only. This can be done adaptively by remembering
> ERRCODE_READ_ONLY_SQL_TRANSACTION errors from previous attempts, or
> explicitly using something like Java's @Transactional(readOnly=true)
> annotations, so that the transaction management library can
> automatically route transactions through the right connection.
>
> 3. Automatically avoid standby servers that have recently failed with
> 40P02 errors.
>
> 4. Somehow know which server is the primary (my Java POC doesn't
> tackle that problem, but there are various techniques, such as trying
> all of them if you start seeing ERRCODE_READ_ONLY_SQL_TRANSACTION from
> the server that you expected to be a primary).
>
> The basic idea is that with a little bit of help from your
> language-specific transaction management infrastructure, your
> application can be 100% unaware, and benefit from load balancing. The
> point is that KeyValueController.java knows nothing about any of that
> stuff, and all the rest is Spring configuration that allows
> transactions to be routed to N database servers. It never shows you
> stale data.
>

Thank you! I'll try it.

Regarding the current (v10 patch) design I have some questions and
comments.

The patch introduces new GUC parameter synchronous_replay. We can set
synchronous_commit = off while setting synchronous_replay = on. With
this setting, the backend will synchrnously wait for standbys to
replay. I'm concerned that having two separate GUC parameters
controling the transaction commit behaviour would confuse users. It's
a just idea but maybe we can use 'remote_apply' for synchronous replay
purpose and introduce new parameter for standby server something like
allow_stale_read.

If while a transaction is waiting for all standbys to replay they
became to unavailable state, should the waiter be released? the patch
seems not to release the waiter. Similarly, wal senders are not aware
of postgresql.conf change while waiting synchronous replay. I think we
should call SyncReplayPotentialStandby() in SyncRepInitConfig().

With the setting synchronous_standby_names = '' and
synchronous_replay_standby_names = '*' we would get the standby's
status in pg_stat_replication, sync_state = 'async' and sync_replay =
'available'. It looks odd to me. Yes, this status is correct in
principle. But considering the architecture of PostgreSQL replication
this status is impossible.

The synchronous_replay_standby_name = '*' setting means that the
backend wait for all standbys connected to the master server to
replay, is that right? In my test, even when some of synchronous
replay standby servers got stuck and then therefore are revoked their
lease, the backend could proceed transactions.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2019-01-15 10:37:58 Re: Ryu floating point output patch
Previous Message Surafel Temesgen 2019-01-15 10:07:58 Re: FETCH FIRST clause WITH TIES option