Re: cheaper snapshots

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: cheaper snapshots
Date: 2011-07-28 23:20:04
Message-ID: CA+TgmoaXtAxKyj312w4vzisLav3HM87XyLxheLOuRzG5X=BN3A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 28, 2011 at 4:54 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> Having transactions become visible in the same order on the master
>> and the standby is very appealing, but I'm pretty well convinced
>> that allowing commits to become visible before they've been
>> durably committed is throwing the "D" an ACID out the window.  If
>> synchronous_commit is off, sure, but otherwise...
>
> It has been durably committed on the master, but not on the
> supposedly synchronous copy; so it's not so much through out the "D"
> in "ACID" as throwing out the "synchronous" in "synchronous
> replication".  :-(

Well, depends. Currently, the sequence of events is:

1. Insert commit record.
2. Flush commit record, if synchronous_commit in {local, on}.
3. Wait for synchronous replication, if synchronous_commit = on and
synchronous_standby_names is non-empty.
4. Make transaction visible.

If you move (4) before (3), you're throwing out the synchronous in
synchronous replication. If you move (4) before (2), you're throwing
out the D in ACID.

> Unless I'm missing something we have a choice to make -- I see four
> possibilities (already mentioned on this thread, I think):
>
> (1)  Transactions are visible on the master which won't necessarily
> be there if a meteor takes out the master and you need to resume
> operations on the replica.
>
> (2)  An asynchronous commit must block behind any pending
> synchronous commits if synchronous replication is in use.

Well, again, there are three levels:

(A) synchronous_commit=off. No waiting!
(B) synchronous_commit=local transactions, and synchronous_commit=on
transactions when sync rep is not in use. Wait for xlog flush.
(C) synchronous_commit=on transactions when sync rep IS in use. Wait
for xlog flush and replication.

Under your option #2, if a type-A transaction commits after a type-B
transaction, it will need to wait for the type-B transaction's xlog
flush. If a type-A transaction commits after a type-C transaction, it
will need to wait for the type-C transaction to flush xlog and
replicate. And if a type-B transaction commits after a type-C
transaction, there's no additional waiting for xlog flush, because the
type-B transaction would have to wait for that anyway. But it will
also have to wait for the preceding type-C transaction to replicate.
So basically, you can't be more asynchronous than the guy in front of
you.

Aside from the fact that this behavior isn't too hot from a user
perspective, it might lead to some pretty complicated locking. Every
time a transaction finishes xlog flush or sync rep, it's got to go
release the transactions that piled up behind it - but not too many,
just up to the next one that still needs to wait on some higher LSN.

> (3)  Transactions become visible on the replica in a different order
> than they became visible on the master.
>
> (4)  We communicate acceptable snapshots to the replica to make the
> order of visibility visibility match the master even when that
> doesn't match the order that transactions returned from commit.
>
> I don't see how we can accept (1) and call it synchronous
> replication.  I'm pretty dubious about (3), because we don't even
> have Snapshot Isolation on the replica, really.  Is (3) where we're
> currently at?  An advantage of (4) is that on the replica we would
> get the same SI behavior at Repeatable Read that exists on the
> master, and we could even use the same mechanism for SSI to provide
> Serializable isolation on the replica.
>
> I (predictably) like (4) -- even though it's a lot of work....

I think that (4), beyond being a lot of work, will also have pretty
terrible performance. You're basically talking about emitting two WAL
records for every commit instead of one. That's not going to be
awesome. It might be OK for small or relatively lightly loaded
systems, or those with "big" transactions. But for something like
pgbench or DBT-2, I think it's going to be a big problem. WAL is
already a major bottleneck for us; we need to find a way to make it
less of one, not more.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2011-07-28 23:29:54 Re: cheaper snapshots
Previous Message Alexander Korotkov 2011-07-28 22:06:40 Re: WIP: Fast GiST index build