Skip site navigation (1) Skip section navigation (2)

Re: 7.4.5 losing committed transactions

From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.4.5 losing committed transactions
Date: 2004-09-25 02:49:33
Message-ID: 4154DCBD.3090206@Yahoo.com (view raw or flat)
Thread:
Lists: pgsql-hackers
On 9/24/2004 10:24 PM, Tom Lane wrote:

> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> Now the scary thing is that not only did this crash rollback a committed 
>> transaction. Another session had enough time in between to receive a 
>> NOTIFY and select the data that got rolled back later.
> 
> Different session, or same session?  NOTIFY is one of the cases that
> would cause the backend to emit messages within the trouble window
> between EndCommand and actual commit.  I don't believe that that path
> will do a deliberate pq_flush, but it would be possible that the NOTIFY
> message fills the output buffer and causes the 'C' message to go out
> prematurely.
> 
> If you can actually prove that a *different session* was able to see as
> committed data that was not safely committed, then we have another
> problem to look for.  I am hoping we have only one nasty bug today ;-)

I do mean *different session*.

My current theory about how the subscriber got out of sync is this:

In Slony the chunks of serializable replication data are applied in one 
transaction, together with the SYNC event and the events CONFIRM record 
plus a notify on the confirm relation. The data provider (master or 
cascading node) does listen on the subscribers (slave) confirm relation. 
So immediately after the subscriber commits, the provider will pick up 
the confirm record and knows now that the data has propagated and could 
be deleted.

If now the crash whipes out the committed transaction, the entire SYNC 
has to be redone. A problem that will be fixed in 1.0.3 can cause the 
replication engine not to restart immediately, and that probably gave 
the data providers cleanup procedure enough time to purge the 
replication data. That way it was possible, that a direct subscriber was 
still in sync, but a cascaded subscriber behind it wasn't. That 
constellation automatically ruled out that the update wasn't captured on 
the master. And since the log forwarding is stored within the same 
transaction too, the direct subscriber who had the correct data, must at 
that time have had the correct replication log as well.

I guess nobody ever relied that heavily on data to be persistent at the 
microsecond the NOTIFY arrives ...


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2004-09-25 02:53:15
Subject: Re: 7.4.5 losing committed transactions
Previous:From: Tom LaneDate: 2004-09-25 02:24:51
Subject: Re: 7.4.5 losing committed transactions

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group