Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Date: 2022-02-10 18:49:50
Message-ID: CA+TgmoY2+iFQeDUjQyCqJmFpztyY5TuZkGKToWb=As=_YgEDfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 9, 2022 at 5:00 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> The problem starts with
>
> commit aa01051418f10afbdfa781b8dc109615ca785ff9
> Author: Robert Haas <rhaas(at)postgresql(dot)org>
> Date: 2022-01-24 14:23:15 -0500
>
> pg_upgrade: Preserve database OIDs.

Well, that's sad.

> I think the most realistic way to address this is the mechanism is to use the
> mechanism from
> https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com

I agree. While I feel sort of bad about missing this issue in review,
I also feel like it's pretty surprising that there isn't something
plugging this hole already. It feels unexpected that our FD management
layer might hand you an FD that references the previous file that had
the same name instead of the one that exists right now, and I don't
think it's too much of a stretch of the imagination to suppose that
this won't be the last problem we have if we don't fix it. I also
agree that the mechanism proposed in that thread is the best way to
fix the problem. The sinval mechanism won't work, since there's
nothing to ensure that the invalidation messages are processed in a
sufficient timely fashion, and there's no obvious way to get around
that problem. But the ProcSignalBarrier mechanism is not intertwined
with heavyweight locking in the way that sinval is, so it should be a
good fit for this problem.

The main question in my mind is who is going to actually make that
happen. It was your idea (I think), Thomas coded it, and my commit
made it a live problem. So who's going to get something committed
here?

> If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
> XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

What about movedb()?

> We perhaps could avoid all relfilenode recycling, issues on the fd level, by
> adding a barrier whenever relfilenode allocation wraps around. But I'm not
> sure how easy that is to detect.

I definitely think it would be nice to cover this issue not only for
database OIDs and tablespace OIDs but also for relfilenode OIDs.
Otherwise we're right on the cusp of having the same bug in that case.
pg_upgrade doesn't drop and recreate tables the way it does databases,
but if somebody made it do that in the future, would they think about
this? I bet not.

Dilip's been working on your idea of making relfilenode into a 56-bit
quantity. That would make the problem of detecting wraparound go away.
In the meantime, I suppose we could do think of trying to do something
here:

/* wraparound, or first post-initdb
assignment, in normal mode */
ShmemVariableCache->nextOid = FirstNormalObjectId;
ShmemVariableCache->oidCount = 0;

That's a bit sketch, because this is the OID counter, which isn't only
used for relfilenodes. I'm not sure whether there would be problems
waiting for a barrier here - I think we might be holding some lwlocks
in some code paths.

> I think we might actually be able to remove the checkpoint from dropdb() by
> using barriers?

That looks like it might work. We'd need to be sure that the
checkpointer would both close any open FDs and also absorb fsync
requests before acknowledging the barrier.

> I think we need to add some debugging code to detect "fd to deleted file of
> same name" style problems. Especially during regression tests they're quite
> hard to notice, because most of the contents read/written are identical across
> recycling.

Maybe. If we just plug the holes here, I am not sure we need permanent
instrumentation. But I'm also not violently opposed to it.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2022-02-10 18:59:39 Re: can we add subscription TAP test option "vcregress subscriptioncheck" for MSVC builds?
Previous Message Jeff Davis 2022-02-10 18:37:54 Re: Per-table storage parameters for TableAM/IndexAM extensions