BUG #15438: Standby corruption after "Too many open files in system" error

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: juanjo(dot)santamaria(at)gmail(dot)com
Subject: BUG #15438: Standby corruption after "Too many open files in system" error
Date: 2018-10-18 18:36:42
Message-ID: 15438-bbe4679a595d795f@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 15438
Logged by: Juanjo Santamaría
Email address: juanjo(dot)santamaria(at)gmail(dot)com
PostgreSQL version: 10.4
Operating system: Redhat 7.3
Description:

After some time working under poor conditions, the open file limit was too
low for the postgres user, the streaming replica ended in a corrupted
state.

The primary was working the whole time and recreating the standby replica,
after configuring the user limits, seems to solve the issue.

The log is quite verbose, so I have tried to summarize at least one
occurrency for every error:

Oct 15 10:47:54 postgres[9507]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:48:38 postgres[9507]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open files in system
Oct 15 10:48:38 postgres[40098]: [15-1] LOG: record with incorrect
prev-link 6/A1A993C8 at 6/BEA99400
Oct 15 10:48:38 postgres[9550]: [9-1] FATAL: could not connect to the
primary server: server closed the connection unexpectedly
Oct 15 10:48:38 postgres[9550]: [9-2] #011#011This probably means the server
terminated abnormally
Oct 15 10:48:38 postgres[9550]: [9-3] #011#011before or while processing the
request.

Oct 15 10:48:53 postgres[9586]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:49:32 postgres[9586]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open f
iles in system
Oct 15 10:49:32 postgres[40098]: [16-1] LOG: invalid record length at
6/BEABD0F8: wanted 24, got 0

Oct 15 10:50:31 postgres[9765]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:50:39 postgres[9765]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open files in system
Oct 15 10:50:39 postgres[40098]: [19-1] LOG: invalid resource manager ID
101 at 6/BEAE3708
Oct 15 10:51:01 postgres[9767]: [9-1] FATAL: could not connect to the
primary server: server closed the connection unexpectedly
Oct 15 10:51:01 postgres[9767]: [9-2] #011#011This probably means the server
terminated abnormally
Oct 15 10:51:01 postgres[9767]: [9-3] #011#011before or while processing the
request.

Oct 15 10:51:06 postgres[9783]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 10:51:39 postgres[9783]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open files in system
Oct 15 10:51:39 postgres[40098]: [20-1] LOG: record with incorrect
prev-link 3E/746F6F72 at 6/BEC3D5E8

Oct 15 11:03:39 postgres[10850]: [9-1] FATAL: could not connect to the
primary server: FATAL: could not open relation mapping file
"global/pg_filenode.map": Too many open files in system

Oct 15 12:00:10 postgres[40101]: [3732-1] LOG: restartpoint complete: wrote
61 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 re
cycled; write=6.159 s, sync=0.018 s, total=6.223 s; sync files=16,
longest=0.006 s, average=0.001 s; distance=451 kB, estimate=2215 kB
Oct 15 12:01:40 postgres[15182]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open
files in system
Oct 15 12:01:40 postgres[40098]: [80-1] LOG: invalid resource manager ID 64
at 6/BEE34F00
Oct 15 12:01:42 postgres[15711]: [9-1] LOG: started streaming WAL from
primary at 6/BE000000 on timeline 3
Oct 15 12:05:38 postgres[15711]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: epoll_create1 failed: Too many open files in system
Oct 15 12:05:38 postgres[40098]: [81-1] LOG: unexpected pageaddr 6/A1E6E000
in log segment 0000000300000006000000BE, offset 15130624

Oct 16 01:34:15 postgres[18890]: [9-1] LOG: started streaming WAL from
primary at 6/C3000000 on timeline 3
Oct 16 01:34:15 postgres[18890]: [10-1] FATAL: could not receive data from
WAL stream: ERROR: could not open file "pg_wal/0000000300000006000000C3":
Too many open files in system

Oct 16 17:58:44 postgres[38030]: [9-1] FATAL: the database system is
starting up
Oct 16 17:58:47 postgres[38029]: [9-1] LOG: entering standby mode
Oct 16 17:58:47 postgres[38029]: [10-1] LOG: redo starts at 6/CF9B3598
Oct 16 17:58:47 postgres[38029]: [11-1] LOG: consistent recovery state
reached at 6/CF9C53B8
Oct 16 17:58:47 postgres[38026]: [8-1] LOG: database system is ready to
accept read only connections
Oct 16 17:58:48 postgres[38029]: [12-1] FATAL: invalid memory alloc request
size 1715044352
Oct 16 17:58:48 postgres[38026]: [9-1] LOG: startup process (PID 38029)
exited with exit code 1
Oct 16 17:58:48 postgres[38026]: [10-1] LOG: terminating any other active
server processes
Oct 16 17:58:48 postgres[38026]: [11-1] LOG: database system is shut down

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2018-10-18 22:39:34 Re: NOTIFY does not work as expected
Previous Message Euler Taveira 2018-10-18 16:52:49 Re: parallel queries hang