Re: Sync Rep v17

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Daniel Farina <daniel(at)heroku(dot)com>
Subject: Re: Sync Rep v17
Date: 2011-02-20 04:26:40
Message-ID: AANLkTi=P1f2rwdQ7pOrdB=nzyQ+4Xq_OKD8aVkYr5pqk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 19, 2011 at 3:35 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, 2011-02-18 at 20:45 -0500, Robert Haas wrote:
>> On the other hand, I see no particular
>> harm in leaving the option in there either, though I definitely think
>> the default should be changed to -1.
>
> The default setting should be to *not* freeze up if you lose the
> standby. That behaviour unexpectedly leads to an effective server down
> situation, rather than 2 minutes of slow running.

My understanding is that the parameter will wait on every commit, not
just once. There's no mechanism to do anything else. But I did some
testing this evening and actually it appears to not work at all. I
hit walreceiver with a SIGSTOP and the commit never completes, even
after the two minute timeout. Also, when I restarted walreceiver
after a long time, I got a server crash.

DEBUG: write 0/3027BC8 flush 0/3014690 apply 0/3014690
DEBUG: released 0 procs up to 0/3014690
DEBUG: write 0/3027BC8 flush 0/3027BC8 apply 0/3014690
DEBUG: released 2 procs up to 0/3027BC8
WARNING: could not locate ourselves on wait queue
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: DEBUG:
shmem_exit(-1): 0 callbacks to make
DEBUG: proc_exit(-1): 0 callbacks to make
FATAL: could not receive data from WAL stream: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Failed.
!> LOG: record with zero length at 0/3027BC8
DEBUG: CommitTransaction
DEBUG: name: unnamed; blockState: STARTED; state: INPROGR,
xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: START_REPLICATION 0/3000000
LOG: streaming replication successfully connected to primary
DEBUG: standby "standby" is a potential synchronous standby
DEBUG: write 0/0 flush 0/0 apply 0/3027BC8
DEBUG: released 0 procs up to 0/0
DEBUG: standby "standby" has now caught up with primary
DEBUG: write 0/3027C18 flush 0/0 apply 0/3027BC8
DEBUG: standby "standby" is now the synchronous replication standby
DEBUG: released 0 procs up to 0/0
DEBUG: write 0/3027C18 flush 0/3027C18 apply 0/3027BC8
DEBUG: released 0 procs up to 0/3027C18
DEBUG: write 0/3027C18 flush 0/3027C18 apply 0/3027C18
DEBUG: released 0 procs up to 0/3027C18

(lots more copies of those last two messages)

I believe the problem is that the definition of IsOnSyncRepQueue is
bogus, so that the loop in SyncRepWaitOnQueue always takes the first
branch.

It was a little confusing to me setting this up that setting only
synchronous_replication did nothing; I had to also set
synchronous_standby_names. We might need a cross-check there. I
believe the docs for synchronous_replication also need some updating;
this part appears to be out of date:

+ between primary and standby. The commit wait will last until the
+ first reply from any standby. Multiple standby servers allow
+ increased availability and possibly increase performance as well.

The words "on the primary" in the next sentence may not be necessary
any more either, as I believe this parameter now has no effect
anywhere else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-02-20 04:31:11 Re: FDW API: don't like the EXPLAIN mechanism
Previous Message Tom Lane 2011-02-20 04:07:46 Re: FDW API: don't like the EXPLAIN mechanism