Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...

From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 'full_page_writes=off' , VACUUM and crashing streaming slaves...
Date: 2012-10-07 00:58:24
Message-ID: EFE82544-2AA5-4899-9FA3-F092D26BF3D4@chittenden.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

>> While the master was still being setup and data was being loaded, the
>> system ran out of open file descriptors and Pg crashed a few times
>> until this was realized and corrected.
>
> Could you provide a bit more detail than that? Usually the behavior we
> see under out-of-FDs conditions is that everything else on the system
> starts to fall over badly, but PG keeps soldiering along to some extent.
> I would not expect data corruption to result, anyway.

Yup, I ran out of time earlier and wasn't able to dig them out, but I just recovered them. It looks like there are two error modes that I ran in to as a result of running out of available FDs.

> If you've got the postmaster logs from this episode, it would be useful
> to see what complaints got logged.

These failures are against 9.2.1, btw.

The first crash scenario:

Oct 5 15:00:24 db01 postgres[75852]: [6449-2] javafail(at)dbcluster 75852 0: STATEMENT: SELECT /* query */ FROM tbl AS this_ WHERE this_.user_id=$1
Oct 5 15:00:24 db01 postgres[75852]: [6456-1] javafail(at)dbcluster 75852 0: ERROR: could not seek to end of file "base/16387/20013": Too many open files in system
[snip - lots of could not seek to end of file errors. How does seek(2) consume a file descriptor??? ]
Oct 5 15:00:24 db01 postgres[75852]: [6480-1] javafail(at)dbcluster 75852 0: ERROR: could not open file "base/16387/20275": Too many open files in system
Oct 5 15:00:24 db01 postgres[75852]: [6480-2] javafail(at)dbcluster 75852 0: STATEMENT: SELECT /* some other query */ FROM tbl2 AS this_ WHERE this_.user_id=$1 limit $2
Oct 5 15:00:25 db01 postgres[75852]: [6481-1] javafail(at)dbcluster 75852 0: ERROR: current transaction is aborted, commands ignored until end of transaction block
Oct 5 15:00:25 db01 postgres[75852]: [6481-2] javafail(at)dbcluster 75852 0: STATEMENT: SELECT 1
Oct 5 15:00:25 db01 postgres[76648]: [5944-1] javafail(at)dbcluster 76648 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:00:25 db01 postgres[76649]: [5944-1] javafail(at)dbcluster 76649 0: FATAL: pipe() failed: Too many open files in system
[ snip - this makes sense, pipe(2)'s able to return ENFILE ]
Oct 5 15:00:25 db01 postgres[76407]: [5894-1] javafail(at)dbcluster 76407 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76312]: [6295-1] javafail(at)dbcluster 76312 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76331]: [8612-1] javafail(at)dbcluster 76331 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76407]: [5894-2] javafail(at)dbcluster 76407 0: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 5 15:00:25 db01 postgres[76312]: [6295-2] javafail(at)dbcluster 76312 0: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 5 15:00:25 db01 postgres[76315]: [7317-1] javafail(at)dbcluster 76315 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76407]: [5894-3] javafail(at)dbcluster 76407 0: HINT: In a moment you should be able to reconnect to the database and repeat your command.
Oct 5 15:00:25 db01 postgres[76302]: [10249-1] javafail(at)dbcluster 76302 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76331]: [8612-2] javafail(at)dbcluster 76331 0: DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 5 15:00:25 db01 postgres[76286]: [7471-1] javafail(at)dbcluster 76286 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76307]: [7628-1] javafail(at)dbcluster 76307 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76285]: [7034-1] javafail(at)dbcluster 76285 0: WARNING: terminating connection because of crash of another server process
Oct 5 15:00:25 db01 postgres[76331]: [8612-3] javafail(at)dbcluster 76331 0: HINT: In a moment you should be able to reconnect to the database and repeat your command.
[snip]
Oct 5 15:00:28 db01 postgres[76875]: [5949-1] javafail(at)dbcluster 76875 0: FATAL: the database system is in recovery mode
[snip]
Oct 5 15:01:28 db01 postgres[76888]: [5951-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:28 db01 postgres[76888]: [5953-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:28 db01 postgres[76888]: [5955-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:28 db01 postgres[76888]: [5957-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
Oct 5 15:01:28 db01 postgres[76888]: [5959-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too many open files in system
[snip]
Oct 5 15:01:31 db01 postgres[76888]: [6503-1] @ 76888 0: WARNING: could not open statistics file "pg_stat_tmp/pgstat.stat": Too m
any open files in system
Oct 5 15:01:31 db01 postgres[77571]: [5978-1] @ 77571 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:01:31 db01 postgres[77572]: [5979-1] javafail(at)dbcluster 77572 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:01:31 db01 postgres[77572]: [5980-1] javafail(at)dbcluster 77572 0: WARNING: terminating connection because of crash of anothe
r server process
Oct 5 15:01:31 db01 postgres[77572]: [5980-2] javafail(at)dbcluster 77572 0: DETAIL: The postmaster has commanded this server process t
o roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Oct 5 15:01:31 db01 postgres[77572]: [5980-3] javafail(at)dbcluster 77572 0: HINT: In a moment you should be able to reconnect to the d
atabase and repeat your command.
Oct 5 15:01:31 db01 postgres[77544]: [7490-1] javafail(at)dbcluster 77544 0: WARNING: terminating connection because of crash of anothe
r server process
[snip]
Oct 5 15:01:34 db01 postgres[77811]: [5983-1] javafail(at)dbcluster 77811 0: FATAL: the database system is in recovery mode
Oct 5 15:02:20 db01 postgres[78473]: [5987-1] javafail(at)dbcluster 78473 0: FATAL: pipe() failed: Too many open files in system
Oct 5 15:02:20 db01 postgres[78466]: [5989-1] javafail(at)dbcluster 78466 0: WARNING: terminating connection because of crash of another server process

What's odd to me is not the failure scenarios that come from a system running out of FDs (though seek(2)'ing consuming an FD seems odd), it's more that it's still possible for a master DB's VACUUM to clean up a bogus or partial page write, and have the slave crash when the WAL entry is shipped over.

Are there any other bits of info that you want me to hold on to, extract, or things you want me to test? -sc

--
Sean Chittenden
sean(at)chittenden(dot)org

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andreas Kretschmer 2012-10-07 05:46:51 Re: Help with query timeout
Previous Message Jan Nielsen 2012-10-06 20:47:48 Re: strange hot_standby behaviour