Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

From: David Powers <dpowers(at)janestreet(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-29 20:59:03
Message-ID: CAJpcCMjAZ7r0Tbs2f9gwtjN573GON4WcE4eeu1UqDmKYyDKpPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It's another possibility, but I think it's still somewhat remote given how
long we've been using this method with this code. It's sadly hard to test
because taking the full backup without the hard linking is fairly expensive
(the databases comprise multiple terabytes).

As a possibly unsatisfying solution I've spent the last day reworking the
backups to use the low level api and the pg_basebackup method to take
snapshots and the streaming replica out of the picture entirely.

-David

On Tue, May 28, 2013 at 7:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, May 28, 2013 at 10:53 AM, Benedikt Grundmann
> <bgrundmann(at)janestreet(dot)com> wrote:
> > Today we have seen
> >
> > 2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10
> > EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> > flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> > base/16416/293974676",,,,""
> > 2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10
> > EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied ---
> > flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation
> > base/16416/293974676",,,,""
> >
> > while taking the backup of the primary. We have been running for a few
> days
> > like that and today is the first day where we see these problems again.
> So
> > it's not entirely deterministic / we don't know yet what we have to do to
> > reproduce.
> >
> > So this makes Robert's theory more likely. However we have also using
> this
> > method (LVM + rsync with hardlinks from primary) for years without these
> > problems. So the big question is what changed?
>
> Well... I don't know. But my guess is there's something wrong with
> the way you're using hardlinks. Remember, a hardlink means two
> logical pointers to the same file on disk. So if either file gets
> modified after the fact, then the other pointer is going to see the
> changes. The xlog flush request not satisfied stuff could happen if,
> for example, the backup is pointing to a file, and the primary is
> pointing to the same file, and the primary modifies the file after the
> backup is taken (thus modifying the backup after-the-fact).
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-05-29 21:01:05 Re: Running pgindent
Previous Message Clark C. Evans 2013-05-29 20:22:41 Re: GRANT role_name TO role_name ON database_name