Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows

From: Andres Freund <andres(at)anarazel(dot)de>
To: buschmann(at)nidsa(dot)net, pgsql-bugs(at)lists(dot)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows
Date: 2019-10-04 20:06:05
Message-ID: 20191004200605.yqcmn75otebwcvyj@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

Thanks for the report!

On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote:
> We just moved our production cluster from pg 11.5 to pg 12.0 under windows
> using pg_dump/initdb/pg_restore.
>
> When we activated the replication slots by
>
> SELECT * FROM pg_create_physical_replication_slot('sam_repli_3');
>
> and tried restarting the server, we got a PANIC in error log:
>
> CPS PRD 2019-10-04 19:10:07 CEST 00000 1:> LOG: database system was shut
> down at 2019-10-04 19:10:02 CEST
> CPS PRD 2019-10-04 19:10:07 CEST XX000 2:> PANIC: could not fsync file
> "pg_replslot/sam_repli_3/state": Bad file descriptor
> CPS PRD 2019-10-04 19:10:07 CEST 00000 6:> LOG: startup process (PID
> 4028) was terminated by exception 0xC0000409
> CPS PRD 2019-10-04 19:10:07 CEST 00000 7:> HINT: See C include file
> "ntstatus.h" for a description of the hexadecimal value.
> CPS PRD 2019-10-04 19:10:07 CEST 00000 8:> LOG: aborting startup due to
> startup process failure
> CPS PRD 2019-10-04 19:10:07 CEST 00000 9:> LOG: database system is shut
> down
>
> We use the EDB distribution from the website under Windows Server 2019
> (September 2019 patch level).
>
> select version ();
> version
> ------------------------------------------------------------
> PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
> (1 Zeile)
>
> This seems to me like a fatal bug which makes the streaming replication
> unusable under Windows x64 /pg12.
>
> The same configuration worked flawlessly under pg 11.x until pg 11.5
>
> By searching on google we encountered a similar error from 2015 under pg
> 9.4.1 reported under BUG #13287:
>
> https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org

Uh, Michael? You just reintroduced this bug in

commit 82a5649fb9dbef12d04cd24799be6bf298d889a6
Author: Michael Paquier <michael(at)paquier(dot)xyz>
Date: 2019-03-09 08:50:55 +0900

Tighten use of OpenTransientFile and CloseTransientFile

This fixes two sets of issues related to the use of transient files in
the backend:
1) OpenTransientFile() has been used in some code paths with read-write
flags while read-only is sufficient, so switch those calls to be
read-only where necessary. These have been reported by Joe Conway.

You pretty much entirely reverted:

commit dfbaed459754e71e01bb0cc90a12802bba3f9786
Author: Andres Freund <andres(at)anarazel(dot)de>
Date: 2015-04-28 00:12:38 +0200

Use a fd opened for read/write when syncing slots during startup.

Some operating systems, including the reporter's windows, return EBADFD
or similar when fsync() is invoked on a O_RDONLY file descriptor.
Unfortunately RestoreSlotFromDisk() does exactly that; which causes
failures after restarts in at least some scenarios.

If you hit the bug the error message will be something like
ERROR: could not fsync file "pg_replslot/$name/state": Bad file descriptor

Simply use O_RDWR instead of O_RDONLY when opening the relevant file
descriptor to fix the bug. Unfortunately I have no way of verifying the
fix, but we've seen similar problems in the past.

This bug goes back to 9.4 where slots were introduced. Backpatch
accordingly.

Reported-By: Patrice Drolet
Bug: #13143:
Discussion: 20150424101006(dot)2556(dot)60897(at)wrigleys(dot)postgresql(dot)org

I realize I perhaps should have added a comment explaining this, but
this is far from the only location that knows we have to know open fds
r/w to be able to fsync them.

What were you even trying to fix by changing this?

Seems also pretty clear that we need a few animals running with fsync
enabled. Not sure how we best can write test infrastructure to make it
easy to set that for all tests. Guess I best start a thread about it on
-hackers.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2019-10-04 20:20:32 BUG #16040: PL/PGSQL RETURN QUERY statement never uses a parallel plan
Previous Message Andres Freund 2019-10-04 19:28:48 Re: BUG #16036: Segmentation fault while doing an update