Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Benedikt Grundmann'" <bgrundmann(at)janestreet(dot)com>, "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>
Cc: "'PostgreSQL-Dev'" <pgsql-hackers(at)postgresql(dot)org>, "'David Powers'" <dpowers(at)janestreet(dot)com>
Subject: Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date: 2013-05-15 05:56:40
Message-ID: 004f01ce5130$f9077a60$eb166f20$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, May 14, 2013 7:19 PM Benedikt Grundmann wrote:
>It's on the production database and the streaming replica.  But not on the
snapshot.

> production
> -rw------- 1 postgres postgres 312778752 May 13 21:28
/database/postgres/base/16416/291498116.3

> streaming replica
> -rw------- 1 postgres postgres 312778752 May 13 23:50
/database/postgres/base/16416/291498116.3
> Is there a way to find out what the file contains?

You can try with pageinspect module in contrib.

> We just got some more information.  All of the following was done / seen
in the logs of the snapshot database.

> After we saw this we run a vacuum full on the table we suspect to be
backed by this file.  This happened:

>WARNING:  concurrent insert in progress within table "js_equity_daily_diff"

> 2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0
> of relation base/16416/291498116",,,,""
> 2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08
EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied ---
flushed only to 1CEE/31266090",,,,,"writing block 0
> of relation base/16416/291498116",,,,""
> And after that these started appearing in logs (and they get repeated
every second now:

> [root(at)nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv  | tail
-n 2
> 2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> 2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> There are no earlier ERROR's in the logs.
> 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13
22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not
satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block > 0 of
relation base/16416/291498116",,,,""
> 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13
22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of
base/16416/291498116","Multiple failures --- write error might be
> permanent.",,,,,,,,""

> The disk is not full nor are there any messages in the kernel logs.

The reason for this is that system is not able to flush XLOG upto requested
point, most likely, the requested flush point is past end of XLOG.
This has been seen to occur when a disk page has a corrupted LSN. (I am
quoting this from comment in code where the above error message occur)

So if XLOG is not flushed checkpointer will not flush even data of file
291498116.

It seems to me that your database where these errors are observed is
corrupted.

With Regards,
Amit Kapila.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2013-05-15 06:24:03 Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Previous Message Hannu Krosing 2013-05-15 05:53:55 Re: Parallel Sort