Re: 7.4.5 losing committed transactions

From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 7.4.5 losing committed transactions
Date: 2004-09-25 02:49:33
Message-ID: 4154DCBD.3090206@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/24/2004 10:24 PM, Tom Lane wrote:

> Jan Wieck <JanWieck(at)Yahoo(dot)com> writes:
>> Now the scary thing is that not only did this crash rollback a committed
>> transaction. Another session had enough time in between to receive a
>> NOTIFY and select the data that got rolled back later.
>
> Different session, or same session? NOTIFY is one of the cases that
> would cause the backend to emit messages within the trouble window
> between EndCommand and actual commit. I don't believe that that path
> will do a deliberate pq_flush, but it would be possible that the NOTIFY
> message fills the output buffer and causes the 'C' message to go out
> prematurely.
>
> If you can actually prove that a *different session* was able to see as
> committed data that was not safely committed, then we have another
> problem to look for. I am hoping we have only one nasty bug today ;-)

I do mean *different session*.

My current theory about how the subscriber got out of sync is this:

In Slony the chunks of serializable replication data are applied in one
transaction, together with the SYNC event and the events CONFIRM record
plus a notify on the confirm relation. The data provider (master or
cascading node) does listen on the subscribers (slave) confirm relation.
So immediately after the subscriber commits, the provider will pick up
the confirm record and knows now that the data has propagated and could
be deleted.

If now the crash whipes out the committed transaction, the entire SYNC
has to be redone. A problem that will be fixed in 1.0.3 can cause the
replication engine not to restart immediately, and that probably gave
the data providers cleanup procedure enough time to purge the
replication data. That way it was possible, that a direct subscriber was
still in sync, but a cascaded subscriber behind it wasn't. That
constellation automatically ruled out that the update wasn't captured on
the master. And since the log forwarding is stored within the same
transaction too, the direct subscriber who had the correct data, must at
that time have had the correct replication log as well.

I guess nobody ever relied that heavily on data to be persistent at the
microsecond the NOTIFY arrives ...

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2004-09-25 02:53:15 Re: 7.4.5 losing committed transactions
Previous Message Tom Lane 2004-09-25 02:24:51 Re: 7.4.5 losing committed transactions