| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | buschmann(at)nidsa(dot)net, pgsql-bugs(at)lists(dot)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> | 
| Subject: | Re: BUG #16039: PANIC when activating replication slots in Postgres 12.0 64bit under Windows | 
| Date: | 2019-10-04 20:06:05 | 
| Message-ID: | 20191004200605.yqcmn75otebwcvyj@alap3.anarazel.de | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-bugs | 
Hi,
Thanks for the report!
On 2019-10-04 19:28:28 +0000, PG Bug reporting form wrote:
> We just moved our production cluster from pg 11.5 to pg 12.0 under windows
> using pg_dump/initdb/pg_restore.
>
> When we activated the replication slots by
>
> SELECT * FROM pg_create_physical_replication_slot('sam_repli_3');
>
> and tried restarting the server, we got a PANIC in error log:
>
> CPS PRD 2019-10-04 19:10:07 CEST  00000  1:> LOG:  database system was shut
> down at 2019-10-04 19:10:02 CEST
> CPS PRD 2019-10-04 19:10:07 CEST  XX000  2:> PANIC:  could not fsync file
> "pg_replslot/sam_repli_3/state": Bad file descriptor
> CPS PRD 2019-10-04 19:10:07 CEST  00000  6:> LOG:  startup process (PID
> 4028) was terminated by exception 0xC0000409
> CPS PRD 2019-10-04 19:10:07 CEST  00000  7:> HINT:  See C include file
> "ntstatus.h" for a description of the hexadecimal value.
> CPS PRD 2019-10-04 19:10:07 CEST  00000  8:> LOG:  aborting startup due to
> startup process failure
> CPS PRD 2019-10-04 19:10:07 CEST  00000  9:> LOG:  database system is shut
> down
>
> We use the EDB distribution from the website under Windows Server 2019
> (September 2019 patch level).
>
> select version ();
>                           version
> ------------------------------------------------------------
>  PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
> (1 Zeile)
>
> This seems to me like a fatal bug which makes the streaming replication
> unusable under Windows x64 /pg12.
>
> The same configuration worked flawlessly under pg 11.x until pg 11.5
>
> By searching on google we encountered a similar error from 2015 under pg
> 9.4.1 reported under BUG #13287:
>
> https://www.postgresql.org/message-id/flat/20150514105514.2691.67352%40wrigleys.postgresql.org
Uh, Michael? You just reintroduced this bug in
commit 82a5649fb9dbef12d04cd24799be6bf298d889a6
Author: Michael Paquier <michael(at)paquier(dot)xyz>
Date:   2019-03-09 08:50:55 +0900
Tighten use of OpenTransientFile and CloseTransientFile
    This fixes two sets of issues related to the use of transient files in
    the backend:
    1) OpenTransientFile() has been used in some code paths with read-write
    flags while read-only is sufficient, so switch those calls to be
    read-only where necessary.  These have been reported by Joe Conway.
You pretty much entirely reverted:
commit dfbaed459754e71e01bb0cc90a12802bba3f9786
Author: Andres Freund <andres(at)anarazel(dot)de>
Date:   2015-04-28 00:12:38 +0200
Use a fd opened for read/write when syncing slots during startup.
    Some operating systems, including the reporter's windows, return EBADFD
    or similar when fsync() is invoked on a O_RDONLY file descriptor.
    Unfortunately RestoreSlotFromDisk() does exactly that; which causes
    failures after restarts in at least some scenarios.
    If you hit the bug the error message will be something like
    ERROR: could not fsync file "pg_replslot/$name/state": Bad file descriptor
    Simply use O_RDWR instead of O_RDONLY when opening the relevant file
    descriptor to fix the bug.  Unfortunately I have no way of verifying the
    fix, but we've seen similar problems in the past.
    This bug goes back to 9.4 where slots were introduced. Backpatch
    accordingly.
    Reported-By: Patrice Drolet
    Bug: #13143:
    Discussion: 20150424101006(dot)2556(dot)60897(at)wrigleys(dot)postgresql(dot)org
I realize I perhaps should have added a comment explaining this, but
this is far from the only location that knows we have to know open fds
r/w to be able to fsync them.
What were you even trying to fix by changing this?
Seems also pretty clear that we need a few animals running with fsync
enabled. Not sure how we best can write test infrastructure to make it
easy to set that for all tests. Guess I best start a thread about it on
-hackers.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | PG Bug reporting form | 2019-10-04 20:20:32 | BUG #16040: PL/PGSQL RETURN QUERY statement never uses a parallel plan | 
| Previous Message | Andres Freund | 2019-10-04 19:28:48 | Re: BUG #16036: Segmentation fault while doing an update |