Re: Issues with Quorum Commit

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Issues with Quorum Commit
Date: 2010-10-06 17:57:57
Message-ID: 4CACB8A5.2040906@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

All,

Let me clarify and consolidate this discussion. Again, it's my goal
that this thread specifically identify only the problems and desired
behaviors for synch rep with more than one sync standby. There are
several issues with even one sync standby which still remain unresolved,
but I believe that we should discuss those on a separate thread, for
clarity.

I also strongly believe that we should get single-standby functionality
committed and tested *first*, before working further on multi-standby.

So, to summarize earlier discussion on this thread:

There are 2 reasons to have more than one sync standby:

1) To increase durability above the level of a single synch standby,
even at the cost of availability.

2) To increase availability without decreasing durability below the
level offered by a single sync standby.

The "pure" setup for each of these options, where N is the number of
standbys and k is the number of acks required from standbys is:

1) k = N, N > 1, apply
2) k = 1, N > 1, recv

(Timeouts are a specific compromise of durability for availability on
*one* server, and as such will not be discussed here. BTW, I was the
one who suggested a timeout, rather than Simon, so if you don't like the
idea, harass me about it.)

Any other configuration (3) than the two above is a specific compromise
between durability and availability, for example:

3a) k = 2, N = 3, fsync
3b) k = 3, N = 10, recv

... should give you better durability than case 2) and better
availability than case 1).

While it's nice to dismiss case (1) as an edge-case, consider the
likelyhood of someone running PostgreSQL with fsync=off on cloud
hosting. In that case, having k = N = 5 does not seem like an
unreasonable arrangement if you want to ensure durability via
replication. It's what the CAP databases do.

After eliminating some of my issues as non-issues, here's what we're
left with for problems on the above:

(1), (3) Accounting/Registration. Implementing any of these cases would
seem to require some form of accounting and/or registration on the
master in terms of, at a minimum, the number of acks for each data send.
More likely we will need, as proposed on other threads, a register of
standbys and the sync state of each. Not only will this
accounting/registration be hard code to write, it will have at least
*some* performance overhead. Whether that overhead is minority or
substantial can only be determined through testing. Further, there's
the issue of whether, and how, we transmit this register to the standbys
so that they can be promoted.

(2), (3) Degradation: (Jeff) these two cases make sense only if we give
DBAs the tools they need to monitor which standbys are falling behind,
and to drop and replace those standbys. Otherwise we risk giving DBAs
false confidence that they have better-than-1-standby reliability when
actually they don't. Current tools are not really adequate for this.

(1), (3) Dynamic Re-configuration: we need the ability to add and remove
standbys at runtime. We also need to have a verdict on how to handle
the case where a transaction is pending, per Heikki.

(2), (3) Promotion: all multi-standby high-availability cases only make
sense if we provide tools to promote the most current standby to be the
new master. Otherwise the whole cluster still goes down whenever we
have to replace the master. We also should provide some mechanism for
promoting an async standby to sync; this has already been discussed.

(1) Consistency: this is another DBA-false-confidence issue. DBAs who
implement (1) are liable to do so thinking that they are not only
guaranteeing the consistency of every standby with the master, but the
consistency of every standby with every other standby -- a kind of dummy
multi-master. They are not, so it will take multiple reminders and
workarounds in the docs to explain this. And we'll get complaints anyway.

(1), (2), (3) Initialization: (Dimitri) we need a process whereby a
standby can go from cloned to synched to being a sync rep standby, and
possibly from degraded to synced again and back.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2010-10-06 18:00:35 Re: patch: tsearch - some memory diet
Previous Message Robert Haas 2010-10-06 17:48:20 Re: patch: tsearch - some memory diet