Re: Cygwin cleanup

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, vignesh C <vignesh21(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Cygwin cleanup
Date: 2023-09-14 23:04:23
Message-ID: CA+hUKG+J4jSFk=-hdoZdcx+p7ru6xuipzCZY-kiKoDc2FjsV7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 8, 2023 at 8:06 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Fri, Jan 13, 2023 at 5:17 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > My patch used fsync_fname_ext() which would cause an ERROR rather than a
> > PANIC when failing to fsync the logical decoding pathname.
>
> FTR While analysing a lot of CI logs trying to debug something else I
> came across a plain Windows/MSVC (not Cygwin) build that panicked like
> this:
>
> https://cirrus-ci.com/task/6689224833892352
> https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/testrun/build/testrun/subscription/013_partition/log/013_partition_publisher.log
> https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/crashlog/crashlog-postgres.exe_0af4_2023-02-05_21-53-20-018.txt

Here are some more flapping CI failures due to this phenomenon
(nothing to do with Cygwin, this is just regular Windows):

4509011781877760 | Windows - Server 2019, VS 2019 - Meson & ninja
4525770962370560 | Windows - Server 2019, VS 2019 - Meson & ninja
5664518341132288 | Windows - Server 2019, VS 2019 - Meson & ninja
5689846694412288 | Windows - Server 2019, VS 2019 - Meson & ninja
5853025126842368 | Windows - Server 2019, VS 2019 - Meson & ninja
6639943179567104 | Windows - Server 2019, VS 2019 - Meson & ninja
6727728217456640 | Windows - Server 2019, VS 2019 - Meson & ninja
6740158104469504 | Windows - Server 2019, VS 2019 - Meson & ninja

They all say something like 'PANIC: could not open file
"pg_logical/snapshots/0-1597938.snap": No such file or directory',
because they all do rename(some_temporary_file, that_name), then try
to re-open and sync it, but rename() on Windows fails to be atomic so
a concurrent process can see an intermediate ENOENT state. I see a
few 'local' workarounds we could do to fix that, but ... there seems
to be a much better idea staring us in the face in the comments!

I think this would be fixed as a happy by-product of this TODO in
SnapBuildSerialize():

* TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
* some noticeable overhead since it's performed synchronously during
* decoding?

I have done no analysis myself of whether that is sound, but assuming
it is, I think the way to achieve that is to tweak FileTag so that it
can describe the file to be fsync'd, and use the sync.c machinery to
fsync the file in the background. Presumably that would provide a
huge speed up for logical decoding, and people would rejoice.

Some other topics that came up in this thread:
* Now that PostgreSQL seems to be stable enough on Cygwin to get
through the basic regression tests reliably, lorikeet might as well
run the full TAP test suite?
* Justin complained about the weird effects of wal_sync_method, and I
finally got around to showing how I think that should be untangled, in
https://commitfest.postgresql.org/44/4453/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2023-09-14 23:26:47 Re: Buffer ReadMe Confuse
Previous Message Peter Smith 2023-09-14 22:53:26 Re: subscription TAP test has unused $result