Re: Sync Rep for 2011CF1

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Sync Rep for 2011CF1
Date: 2011-02-08 18:53:41
Message-ID: AANLkTikUxTM+6EaZ8S_OteDzCXaccsPVsd1U_ZUzM5i0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 7, 2011 at 1:20 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Jan 15, 2011 at 4:40 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Here's the latest patch for sync rep.
>
> Here is a rebased version of this patch which applies to head of the
> master branch.  I haven't tested it yet beyond making sure that it
> compiles and passes the regression tests -- but this fixes the bitrot.

As I mentioned yesterday that I would do, I spent some time working on
this. I think that there are somewhere between three and six
independently useful features in this patch, plus a few random changes
to the documentation that I'm not sure whether want or not (e.g.
replacing master by primary in a few places, or the other way around).

One problem with the core synchronous replication technology is that
walreceiver cannot both receive WAL and write WAL at the same time.
It switches back and forth between reading WAL from the network socket
and flushing it to disk. The impact of that is somewhat mitigated in
the current patch because it only implements the "fsync" level of
replication, and chances are that the network read time is small
compared to the fsync time. But it would certainly suck for the
"receive" level we've talked about having in the past, because after
receiving each batch of WAL, the WAL receiver wouldn't be able to send
any more acknowledgments until the fsync completed, and that's bound
to be slow. I'm not really sure how bad it will be in "fsync" mode;
it may be tolerable, but as Simon noted in a comment, in the long run
it'd certainly be nicer to have the WAL writer process running during
recovery.

As a general comment on the quality of the code, I think that the
overall logic is probably sound, but there are an awful lot of
debugging leftovers and inconsistencies between various parts of the
patch. For example, when I initially tested it, *asynchronous*
replication kept breaking between the master and the standby, and I
couldn't figure out why. I finally realized that there was a ten
second pause that had been inserting into the WAL receiver loop as a
debugging tool which was allowing the standby to get far enough behind
that the master was able to recycle WAL segments the standby still
needed. Under ordinary circumstances, I would say that a patch like
this was not mature enough to submit for review, let alone commit.
For that reason, I am pretty doubtful about the chances of getting
this finished for 9.1 without some substantial prolongation of the
schedule.

That having been said, there is at least one part of this patch which
looks to be in pretty good shape and seems independently useful
regardless of what happens to the rest of it, and that is the code
that sends replies from the standby back to the primary. This allows
pg_stat_replication to display the write/flush/apply log positions on
the standby next to the sent position on the primary, which as far as
I am concerned is pure gold. Simon had this set up to happen only
when synchronous replication or XID feedback in use, but I think
people are going to want it even with plain old asynchronous
replication, because it provides a FAR easier way to monitor standby
lag than anything we have today. I've extracted this portion of the
patch, cleaned it up a bit, written docs, and attached it here.

I wasn't too sure how to control the timing of the replies. It's
worth noting that you have to send them pretty frequently for the
distinction between xlog written and xlog flushed to have any value.
What I've done here is made it so that every time we read all
available data on the socket, we send a reply. After flushing, we
send another reply. And then just for the heck of it we send a reply
at least every 10 seconds (configurable), which causes the
last-known-apply position to eventually get updated on the master.
This means the apply position can lag reality by a bit. Simon's
version adds a latch, so that the startup process can poke the WAL
receiver to send a reply when the apply position moves. But this is
substantially more complex and I'm not sure it's worth it. If we were
implementing the "apply" level of synchronized replication, we'd
clearly need that for performance not to stink. But since the patch
is only implementing "fsync" anyway, it doesn't seem necessary for
now.

The only real complaint I can imagine about offering this
functionality all the time is that it uses extra bandwidth. I'm
inclined to think that the ability to shut it off completely is
sufficient answer to that complaint.

<dons asbestos underwear>

Comments?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
wal-receiver-replies.patch application/octet-stream 25.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-02-08 19:01:38 Re: SSI patch version 14
Previous Message Magnus Hagander 2011-02-08 18:52:03 Scheduled maintenance affecting gitmaster