Re: Synchronous replay take III

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Synchronous replay take III
Date: 2018-12-01 01:48:29
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:
> Unfortunately, cfbot says that patch can't be applied without conflicts, could
> you please post a rebased version and address commentaries from Masahiko?

Right, it conflicted with 4c703369 and cfdf4dc4. While rebasing on
top of those, I found myself wondering why syncrep.c thinks it needs
special treatment for postmaster death. I don't see any reason why we
shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
this new version. If you kill -9 the postmaster, I don't see any
reason to think that the existing coding is more correct than simply
exiting immediately.

On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> > I was pinged off-list by a fellow -hackers denizen interested in the
> > synchronous replay feature and wanting a rebased patch to test. Here
> > it goes, just in time for a Commitfest. Please skip to the bottom of
> > this message for testing notes.
> Thank you for working on this. The overview and your summary was
> helpful for me to understand this feature, thank you. I've started to
> review this patch for PostgreSQL 12. I've tested this patch and found
> some issue but let me ask you questions about the high-level design
> first. Sorry if these have been already discussed.

Thanks for your interest in this work!

> > This is a design choice favouring read-mostly workloads at the expense
> > of write transactions. Hot standbys' whole raison for existing is to
> > move *some* read-only workloads off the primary server. This proposal
> > is for users who are prepared to trade increased primary commit
> > latency for a guarantee about visibility on the standbys, so that
> > *all* read-only work could be moved to hot standbys.
> To be clear what did you mean read-mostly workloads?

I mean workloads where only a small percentage of transactions perform
a write. If you need write-scalability, then hot_standby is not the
solution for you (with or without this patch).

The kind of user who would be interested in this feature is someone
who already uses some kind of heuristics to move some queries to
read-only standbys. For example, some people send transaction for
logged-in users to the primary database (because only logged-in users
generate write queries), and all the rest to standby servers (for
example "public" users who can only read content). Another technique
I have seen is to keep user sessions "pinned" on the primary server
for N minutes after they perform a write transaction. These types of
load balancing policies are primitive ways of achieving
read-your-writes consistency, but they are conservative and
pessimistic: they probably send too many queries to the primary node.

This proposal is much more precise, allowing you to run the minimum
number of transactions on the primary node (ie transactions that
actually need to perform a write), and the maximum number of
transactions on the hot standbys.

As discussed, making reads wait for a token would be a useful
alternative (and I am willing to help make that work too), but:

1. For users that do more many more reads than writes, would you
rather make (say) 80% of transactions slower or 20%? (Or 99% vs 1% as
the case may be, depending on your application.)

2. If you are also using synchronous_commit = on for increased
durability, then you are already making writers wait, and you might be
able to tolerate a small increase.

Peter Eisentraut expressed an interesting point of view against this
general line of thinking:

My questions are: Why do we have hot_standby mode? Is load balancing
a style of usage we want to support? Do we want a technology that
lets you do more of it?

> I think there are two kind of reads on standbys: a read happend after
> writes and a directly read (e.g. reporting). The former usually
> requires the causal reads as you mentioned in order to read its own
> writes but the latter might be different: it often wants to read the
> latest data on the master at the time. IIUC even if we send a
> read-only query directly to a synchronous replay server we could get a
> stale result if the standby delayed for less than
> synchronous_replay_max_lag. So this synchronous replay feature would
> be helpful for the former case(i.e. a few writes and many reads wants
> to see them) whereas for the latter case perhaps the keeping the reads
> waiting on standby seems a reasonable solution.

I agree 100% that this is not a solution for all users. But I also
suspect a token system would be quite complicated, and can't be done
in a way that is transparent to applications without giving up
performance advantages. I wrote about my understanding of the
trade-offs here:

> Also I think it's worth to consider the cost both causal reads *and*
> non-causal reads.
> I've considered a mixed workload (transactions requiring causal reads
> and transactions not requiring it) on the current design. IIUC the
> current design seems like that we create something like
> consistent-reads group by specifying servers. For example, if a
> transaction doesn't want to causality read it can send query any
> server with synchronous_replay = off but if it wants, it should select
> a synchronous replay server. It also means that client applications or
> routing middlewares such as pgpool is required to be aware of
> available synchronous replay standbys. That is, this design would cost
> the read-only transactions requiring causal reads. On the other hand,
> in token-based causal reads we can send read-only query any standbys
> if we can wait for the change to be replayed. Of course if we don't
> wait forever we can timeout and switch to either another standby or
> the master to execute query but we don't need to choose a server of
> standby servers.

Yeah. I think tools like pgpool that already know how to connect to
the primary and look at pg_stat_replication could use the new column
to learn which servers support synchronous replay, for routing
purposes. I also think that existing read/write load balancing tools
for Python (eg "django-balancer"), Ruby (eg "makara"), Java could be
adjusted to work with this quite easily.

In response to a general question from Simon Riggs at a conference
about how anyone is supposed to use this thing in real life, I wrote a
proof-of-concept Java Spring application that shows the techniques
that I think are required to make good use of it:

1. Use a transaction management library (this includes Python Django
transaction management, Ruby ActiveRecord IIUC, Java Spring
declarative transactions, ...), so that whole transactions can be
retried automatically. This is generally a good idea anyway because
it lets you retry automatically on serialisation failures and deadlock
errors. The new error 40P02
retry, in SQL error code class "40" (or perhaps is should be "72"... I
have joked that the new error could be called "snapshot too young"!)

2. Classify transactions (= blocks of code that run a transaction) as
read-write or read-only. This can be done adaptively by remembering
ERRCODE_READ_ONLY_SQL_TRANSACTION errors from previous attempts, or
explicitly using something like Java's @Transactional(readOnly=true)
annotations, so that the transaction management library can
automatically route transactions through the right connection.

3. Automatically avoid standby servers that have recently failed with
40P02 errors.

4. Somehow know which server is the primary (my Java POC doesn't
tackle that problem, but there are various techniques, such as trying
all of them if you start seeing ERRCODE_READ_ONLY_SQL_TRANSACTION from
the server that you expected to be a primary).

The basic idea is that with a little bit of help from your
language-specific transaction management infrastructure, your
application can be 100% unaware, and benefit from load balancing. The
point is that knows nothing about any of that
stuff, and all the rest is Spring configuration that allows
transactions to be routed to N database servers. It never shows you
stale data.

Thomas Munro

Attachment Content-Type Size
0001-Synchronous-replay-mode-for-avoiding-stale-reads--v9.patch application/octet-stream 82.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2018-12-01 03:57:05 Re: Synchronous replay take III
Previous Message David Fetter 2018-12-01 00:03:40 Re: [PATCH] Log CSV by default