Quick Links

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Date:	2022-04-22 07:37:30
Message-ID:	CA+hUKGKOofj-F2bROOxkLTU4XUjztbizVtyu7YBHx5te9tvvyg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Apr 6, 2022 at 5:07 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> > > The checkpointer never takes heavyweight locks, so the opportunity
> > > you're describing can't arise.
> >
> > <thinks harder> Hmm, oh, you probably meant the buffer interlocking
> > in SyncOneBuffer(). It's true that my most recent patch throws away
> > more requests than it could, by doing the level check at the end of
> > the loop over all buffers instead of adding some kind of
> > DropPendingWritebacks() in the barrier handler. I guess I could find
> > a way to improve that, basically checking the level more often instead
> > of at the end, but I don't know if it's worth it; we're still throwing
> > out an arbitrary percentage of writeback requests.
>
> Doesn't every backend have its own set of pending writebacks?
> BufferAlloc() calls
> ScheduleBufferTagForWriteback(&BackendWritebackContext, ...)?

Yeah. Alright, for my exaggeration above, s/can't arise/probably
won't often arise/, since when regular backends do this it's for
random dirty buffers, not necessarily stuff they hold a relation lock
on.

I think you are right that if we find ourselves calling smgrwrite() on
another buffer for that relfilenode a bit later, then it's safe to
"unzonk" the relation. In the worst case, IssuePendingWritebacks()
might finish up calling sync_file_range() on a file that we originally
wrote to for generation 1 of that relfilenode, even though we now have
a file descriptor for generation 2 of that relfilenode that was
reopened by a later smgrwrite(). That would be acceptable, because a
bogus sync_file_range() call would be harmless. The key point here is
that we absolutely must avoid re-opening files without careful
interlocking, because that could lead to later *writes* for generation
2 going to the defunct generation 1 file that we opened just as it was
being unlinked.

(For completeness: we know of another way for smgrwrite() to write to
the wrong generation of a recycled relfilenode, but that's hopefully
very unlikely and will hopefully be addressed by adding more bits and
killing off OID-wraparound[1]. In this thread we're concerned only
with these weird "explicit OID reycling" cases we're trying to fix
with PROCSIGNAL_BARRIER_SMGRRELEASE sledgehammer-based cache
invalidation.)

My new attempt, attached, is closer to what Andres proposed, except at
the level of md.c segment objects instead of level of SMgrRelation
objects. This avoids the dangling SMgrRelation pointer problem
discussed earlier, and is also a little like your "zonk" idea, except
instead of a zonk flag, the SMgrRelation object is reset to a state
where it knows nothing at all except its relfilenode. Any access
through smgrXXX() functions *except* smgrwriteback() will rebuild the
state, just as you would clear the hypothetical zonk flag.

So, to summarise the new patch that I'm attaching to this email as 0001:

1. When PROCSIGNAL_BARRIER_SMGRRELEASE is handled, we call
smgrreleaseall(), which calls smgrrelease() on all SMgrRelation
objects, telling them to close all their files.

2. All SMgrRelation objects are still valid, and any pointers to them
that were acquired before a CFI() and then used in an smgrXXX() call
afterwards will still work (the example I know of being
RelationCopyStorage()[2]).

3. mdwriteback() internally uses a new "behavior" flag
EXTENSION_DONT_OPEN when getting its hands on the internal segment
object, which says that we should just give up immediately if we don't
already have the file open (concretely: if
PROCSIGNAL_BARRIER_SMGRRELEASE came along between our recent
smgrwrite() call and the writeback machinery's smgrwriteback() call,
it'll do nothing at all).

Someone might say that it's weird that smgrwriteback() has that
behaviour internally, and we might eventually want to make it more
explicit by adding a "behavior" argument to the function itself, so
that it's the caller that controls it. It didn't seem worth it for
now though; the whole thing is a hint to the operating system anyway.

However it seems that I have something wrong, because CI is failing on
Windows; I ran out of time for looking into that today, but wanted to
post what I have so far since I know we have an open item or two to
close here ASAP...

Patches 0002-0004 are Andres's, with minor tweaks:

v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch:

It seems useless to even keep the macro USE_BARRIER_SMGRRELEASE if
we're going to define it always, so I removed it. I guess it's useful
to be able to disable that logic easily to see that the assertion in
the other patch fails, but you can do that by commenting out a line in
ProcessBarrierSmgrRelease().

I'm still slightly confused about whether we need an *extra* global
barrier in DROP TABLESPACE, not just if destroy_tablespace_directory()
failed.

v3-0003-WIP-test-for-file-reuse-dangers-around-database-a.patch

Was missing:

-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pg_prewarm

v3-0004-WIP-AssertFileNotDeleted-fd.patch

+ * XXX: Figure out which operating systems this works on.

Do we want this to be wrapped in some kind of macros that vanishes
into thin air unless you enable it with USE_UNLINKED_FILE_CHECKING or
something? Then we wouldn't care so much if it doesn't work on some
unusual system. Bikeshed mode: I would prefer "not unlinked", since
that has a more specific meaning than "not deleted".

[1] https://www.postgresql.org/message-id/flat/CA%2BTgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg%40mail.gmail.com#1070c79256f2330ec52f063cdbe2add0
[2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBfaZ2puQDYV6h2oSWO2QW21_JOXZpha65gWRcmGNCZA%40mail.gmail.com#5deeb3d8adae4daf0da1f09e509eef56

Attachment	Content-Type	Size
v3-0001-Rethink-PROCSIGNAL_BARRIER_SMGRRELEASE.patch	text/x-patch	6.5 KB
v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch	text/x-patch	4.7 KB
v3-0003-WIP-test-for-file-reuse-dangers-around-database-a.patch	text/x-patch	8.6 KB
v3-0004-WIP-AssertFileNotDeleted-fd.patch	text/x-patch	6.7 KB

In response to

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: at 2022-04-05 17:07:12 from Robert Haas

Responses

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: at 2022-05-03 18:36:11 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Magnus Hagander	2022-04-22 07:49:34	Re: Estimating HugePages Requirements?
Previous Message	Yugo NAGATA	2022-04-22 05:58:01	Re: Implementing Incremental View Maintenance