Re: Issues with two-server Synch Rep

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Issues with two-server Synch Rep
Date: 2010-10-11 19:22:54
Message-ID: AANLkTimLNHvXV7HtpaNy_RVdJ+mRhu8tjiwzrhBRvG6i@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 11, 2010 at 2:07 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> I'll take a crack at answering these.  I don't think that the
>> procedure for setting up a standby server is going to change much.
>> The idea is presumably that you set up an async standby more or less
>> as you do now and then make whatever configuration changes are
>> necessary to flip it to synchronous.
>
> What is the specific "flip" procedure, though?  For one thing, I want to
> make sure that it's not necessary to restart the master or the standby
> to "flip" it, since that would be a catch-22.

Obviously. I presume it'll be something like "update postgresql.conf
or recovery.conf and run pg_ctl reload", but I haven't (yet, anyway)
verified the actual behavior of the patches, but if the above isn't
feasible then we have a problem.

>> This is a completely separate issue from making replication
>> synchronous.  And, really?  Useless for running read queries?
>
> Absolutely.  For a synch standby, you can't tolerate any standby delay
> at all.  This means that anywhere from 1/4 to 3/4 of queries on the
> standby would be cancelled on any high-traffic OLTP server.  Hence,
> "useless".

What is your source for those numbers? They could be right, but I
simply don't know.

At any rate, I don't disagree that we have a problem. In fact, I
think we have a whole serious of problems. The whole architecture of
replication as it exists in PG is pretty fundamentally limited right
now. Right now, a pruning operation on the master (regardless of
whether it's a HOT prune or vacuum) can happen when there are still
snapshots on the slave that need that data. Our only options are to
either wait for those snapshots to go away, or kill of the
queries/transactions that took them. Adding an XID feedback from the
slave to the master "fixes" the problem by preventing the master from
pruning those tuples until the slave no longer needs them, but at the
expense of bloating the master and all other standbys. That may,
indeed, be better for some use cases, but it's not really all that
good. It would be far better if we could decouple master cleanup from
standby cleanup, so that only the machine that actually has the old
query gets bloated. However, no one seems excited about writing that
code.

A further grump about our current architecture is that it doesn't seem
at all clear how to make it work for partial replication. I have to
wonder whether we are going down the wrong path completely and need to
hit the reset button. But neither this nor the pruning problem are
things that we can reasonably expect the sync rep patch to solve, if
we want it to get committed this release cycle.

>>>  As such, any Synch Rep patch
>>> must work together with attempts to simplify administration.  How does
>>> your design do this?
>>
>> This is also completely out of scope for sync rep.
>
> It is not, given that I've seen several proposals for synch rep which
> would make asynch rep even more complicated than it already is.

I'm not aware of any proposals on the table which would do that.

> I'm
> taking the stance that any sync rep design which *blocks* making asynch
> rep easier to use is fundamentally flawed and can't be accepted.

Do you have some ideas on how to simplify it? How will we know
whether a particular design for sync rep does this?

>> I don't think there's much hope of allowing administrators to take
>> action BEFORE the database becomes unavailable.
>
> I'd swear that you were working as a DBA less than a year ago, but I
> couldn't tell it from that statement.

Your comment sounded to me like you were asking for a schedule of all
future unplanned outages.

> There is every bit of value in allowing DBAs to view, and chart,
> response times on the standby for ACK.  That way they can notice an
> increase in response times and take action to improve the standby
> *before* it locks up the system.

Sure, that would be nice to have, and it's a good idea. But I don't
think that's going to be a common failure mode. What I expect to
happen is the standby to hum along with no problem for a long time and
then either kick a disk or suffer a power outage. There's very little
monitoring we can do within PG that will notice either of those things
coming. There might be some external-to-PG monitoring that can be
done, but if there's a massive blackout or a terrorist attack or
somebody trips over the power cord, you're just going to get
surprised.

>>   Presumably, if
>> synchronous replication is disabled via (1) or (2) above, then any
>> outstanding committed-but-unacknowledged-to-the-client transactions
>> should notify the client of the commit and continue on.
>
> That's what I was asking about.  I'm not "presuming" that any pending
> patch covers any such eventuality until it's confirmed.

Yep, we need to confirm that.

>> If a client loses the connection after issuing a commit but before
>> receiving the acknowledgment, it can't know whether the commit
>> happened or not.  This is true regardless of whether there is a
>> standby and regardless of whether that standby is synchronous.
>> Clients that care need to implement their own mechanisms for resolving
>> this difficulty.
>
> That's a handwavy way of saying "go away, don't bother us with such
> details".  For the client to resolve the situation, then *it* needs to
> be able to tell whether or not the transaction was committed.  How would
> it do this, exactly?

No, it isn't at all. What does your application do NOW if the master
goes down after you've sent a commit and before you get an
acknowledgment back? Does it assume that the transaction is
committed, or does it assume the transaction was aborted by a crash on
the master? Either is possible, right?

>> It's theoretically impossible for the transaction to become visible
>> everywhere simultaneously.  It's already the case that transactions
>> become visible to other backends before the backend doing the commit
>> has received an acknowledgment.  Any client relying on any other
>> behavior is already broken.
>
> So, your opinion is "it's out of scope to handle this issue" ?

What handling of it would you propose? Consider the case where you
just have one server and no standbys. A client connects, does some
work, and says COMMIT. There is some finite amount of time after the
COMMIT happens and before the client gets the acknowledgment back that
the commit has succeeded. During that time, another transaction that
starts up will see the effects of the COMMIT - BEFORE the transaction
itself knows that it is committed. There's not much you can do about
this. You have to do the commit on the server before sending the
response back to the client.

In the sync rep case, you're going to get the same behavior. After
the client has asked for commit and before the commit has been
acknowledged, there's no guarantee whether another transaction that
starts up during that in-between time sees the transaction or not.
The only further anomaly that can happen as a result of sync rep is
that, in apply mode, the transaction's effects will become visible on
the standby before they are visible on the master, so if you fire off
a COMMIT, and then before receiving the acknowledgment start a
transaction on the standby, and then just after that start a
transaction on the master, and then just after that you get back an
acknowledgment that the COMMIT completed, you might have a snapshot on
the master that was taken afterwards chronologically but shows the
effects of fewer committed XIDs - i.e. time has gone backwards.
Unfortunately, short of a global transaction manager, this is an
unsolvable problem, and that's definitely more than is going to happen
for 9.1, I think.

>> Sync rep is going to be slow, period.  Every implementation currently
>> on the table has to fsync on the master, and then send the commit xlog
>> record to the slave and wait for an acknowledgment from the slave.
>> Allowing those to happen in parallel is going to be Hard.
>
> Yes, but it's something we need to address.

I agree, but it's not something we can address in the first patch,
which is hard enough without adding things that make it even harder.
We need to get something simple committed first and then build on it.

> XA is widely distrusted and
> is seen as inadequate for high-traffic OLTP systems precisely because it
> is SO slow.  If we want to create a synch rep system which people will
> want to use, then it has to be faster than XA.  If it's not faster than
> XA, why bother creating it?  We already have 2PC.

I don't know anything about XA so I can't comment on this.

>> Also, the
>> interaction with max_standby_delay is going to be a big problem, I
>> suspect.
>
> Interaction?  My opinion is that the two are completely incompatible.
> You can't have synch rep and also have standby_delay > 0.

We seem to be in violent agreement on this point. I was saying the
same thing in a different way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Neil Whelchel 2010-10-11 19:54:57 Re: Slow count(*) again...
Previous Message Pavel Stehule 2010-10-11 18:46:47 Re: wip: functions median and percentile