Re: Synchronous Standalone Master Redoux

From: Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Synchronous Standalone Master Redoux
Date: 2012-07-13 00:27:01
Message-ID: CAETJ_S9Tr8aFhy9xDKExbawgMdnw8NaFkRKdBDsovU8i6nw+0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 12, 2012 at 8:35 AM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> Hi,
>
> Jose Ildefonso Camargo Tolosa <ildefonso(dot)camargo(at)gmail(dot)com> writes:
>> environments. And no, it doesn't makes synchronous replication
>> meaningless, because it will work synchronous if it have someone to
>> sync to, and work async (or standalone) if it doesn't: that's perfect
>> for HA environment.
>
> You seem to want Service Availibility when we are providing Data
> Availibility. I'm not saying you shouldn't ask what you're asking, just
> that it is a different need.

Yes, and no: I don't see why we can't have and option to choose which
one we want. I can see the point of "data availability": it is better
freeze the service, than risk losing transactions... however, try to
explain that to some managers: "well, you know, the DB server froze
the whole bank system because, well, the standby server died, and we
didn't want to risk transaction loss, we just froze the master.... you
know, in case the master were to die too before the we had a reliable
standby." I don't think a manager would really understand why you
would block the whole company's system, just because *the standby*
server died (and why you don't block it, when the master dies?!).
Now, maybe that's a bad example, I know a bank should have at least 3
or 4 servers, with some of them in different geographical areas, but
just think on the typical boss.

In "Service Availability", you have data Availability most of the
time, until one of the servers fails (if you have just 2 nodes), what
if you have more than two: well, good for you! But, you can keep
going with a single server, understanding that you are in a high risk,
that have to be fixed real soon (emergency).

>
> If you troll the archives, you will see that this debate has received
> much consideration already. The conclusion is that if you care about
> Service Availibility you should have 2 standby servers and set them both
> as candidates to being the synchronous one.

That's more cost, and for most applications: it doesn't worth the extra cost.

Really, I see the point you have, and I have *never* asked to remove
the data warranties, but to have an option to relax it, if the
particular situation requires it: "enough safety" for a given cost.

>
> That way, when you lose one standby the service is unaffected, the
> second standby is now the synchronous one, and it's possible to
> re-attach the failed standby live, with or without archiving (with is
> preferred so that the master isn't involved in the catch-up phase).
>
>> As synchronous standby currently is, it just doesn't fit the HA usage,
>
> It does actually allow both data high availability and service high
> availability, provided that you feed at least two standbys.

Still, doesn't fit. You need to spend more hardware, and more power
(and money there), and more carbon footprint, ..... you get the point,
also, having 3 servers for your DB can be necessary (and possible) for
some companies, but for others: no.

>
> What you seem to be asking is both data and service high availability
> with only two nodes. You're right that we can not provide that with
> current releases of PostgreSQL. I'm not sure anyone has a solid plan to
> make that happen.
>
>> and if you really want to keep it that way, it doesn't belong to the
>> HA chapter on the pgsql documentation, and should be moved. And NO
>> async replication will *not* work for HA, because the master can have
>> more transactions than standby, and if the master crashes, the standby
>> will have no way to recover these transactions, with synchronous
>> replication we have *exactly* what we need: the data in the standby,
>> after all, it will apply it once we promote it.
>
> Exactly. We want data availability first. Service availability is
> important too, and for that you need another standby.

Yeah, you need that with PostgreSQL, but no with DRBD, for example
(sorry, but DRBD is one of the flagships of HA things in the Linux
world). Also, I'm not convinced about the "2nd standby" thing... I
mean, just read this on the docs, which is a little alarming:

"If primary restarts while commits are waiting for acknowledgement,
those waiting transactions will be marked fully committed once the
primary database recovers. There is no way to be certain that all
standbys have received all outstanding WAL data at time of the crash
of the primary. Some transactions may not show as committed on the
standby, even though they show as committed on the primary. The
guarantee we offer is that the application will not receive explicit
acknowledgement of the successful commit of a transaction until the
WAL data is known to be safely received by the standby."

So... there is no *real* warranty here either... I don't know how I
skipped that paragraph before today.... I mean, this implies that it
is possible that a transaction could be marked as commited on the
master, but the app was not informed on that (and thus, could try to
send it again), and the transaction was NOT applied on the standby....
how can this happen? I mean, when the master comes back, shouldn't the
standby get the missing WAL pieces from the master and then apply the
transaction? The standby part is the one that I don't really get, on
the application side... well, there are several ways in which you can
miss the "commit confirmation": connection issues in the worst moment,
and the such, so, I guess it is not *so* serious, and the app should
have a way of checking its last transaction if it lost connectivity to
server before getting the transaction commited.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jose Ildefonso Camargo Tolosa 2012-07-13 00:38:57 Re: Synchronous Standalone Master Redoux
Previous Message Mike Wilson 2012-07-13 00:21:31 Re: BUG #6733: All Tables Empty After pg_upgrade (PG 9.2.0 beta 2)