Re: Inconsistent DB data in Streaming Replication

From: Samrat Revagade <revagade(dot)samrat(at)gmail(dot)com>
To: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, sthomas(at)optionshouse(dot)com, ants(at)cybertec(dot)at, andres(at)2ndquadrant(dot)com
Subject: Re: Inconsistent DB data in Streaming Replication
Date: 2013-04-10 09:53:05
Message-ID: CAF8Q-GyF=vrm+WLHhCLtLtg0skb_LkZwFEWjhRvcG=iybFyzwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>>it's one of the reasons why a fresh base backup is required when starting
old master as new standby? >>If yes, I agree with you. I've often heard the
complaints about a backup when restarting new standby. >>That's really big
problem.

I think Fujii Masao is on the same page.

>In case of syncrep the master just waits for confirmation from standby
before returning to client on >commit.

>Not just commit, you must stop any *writing* of the wal records
effectively killing any parallelism.
> Min issue is that it will make *all* backends dependant on each sync
commit, essentially serialising all >backends commits, with the
serialisation *including* the latency of roundtrip to client. With current
>sync streaming the other backends can continue to write wal, with proposed
approach you cannot >write any records after the one waiting an ACK from
standby.

Let me rephrase the proposal in a more accurate manner:

Consider following scenario:

(1) A client sends the "COMMIT" command to the master server.

(2) The master writes WAL record to disk

(3) The master writes the data page related to this transaction. i.e. via
checkpoint or bgwriter.

(4) The master sends WAL records continuously to the standby, up to the
commit WAL record.

(5) The standby receives WAL records, writes them to the disk, and then
replies the ACK.

(6) The master returns a success indication to a client after it receives
ACK.

If failover happens between (3) and (4), WAL and DB data in old master are
ahead of them in new master. After failover, new master continues running
new transactions independently from old master. Then WAL record and DB data
would become inconsistent between those two servers. To resolve these
inconsistencies, the backup of new master needs to be taken onto new
standby.

But taking backup is not feasible in case of larger database size with
several TB over a slow WAN.

So to avoid this type of inconsistency without taking fresh backup we are
thinking to do following thing:

>> I think that you can introduce GUC specifying whether this extra check
is required to avoid a backup >>when failback.

Approach:

Introduce new GUC option specifying whether to prevent PostgreSQL from
writing DB data before corresponding WAL records have been replicated to
the standby. That is, if this GUC option is enabled, PostgreSQL waits for
corresponding WAL records to be not only written to the disk but also
replicated to the standby before writing DB data.

So the process becomes as follows:

(1) A client sends the "COMMIT" command to the master server.

(2) The master writes the commit WAL record to the disk.

(3) The master sends WAL records continuously to standby up to the commit
WAL record.

(4) The standby receives WAL records, writes them to disk, and then replies
the ACK.

(5) *The master then forces a write of the data page related to this
transaction. *

(6) The master returns a success indication to a client after it receives
ACK.

While master is waiting to force a write (point 5) for this data page,
streaming replication continuous. Also other data page writes are not
dependent on this particular page write. So the commit of data pages are
not serialized.

Regards,

Samrat

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dang Minh Huong 2013-04-10 09:57:41 Re: replication_timeout not effective
Previous Message Amit Kapila 2013-04-10 09:33:59 Re: replication_timeout not effective