Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Date: 2022-02-10 22:14:08
Message-ID: 20220210221408.b3joflmr4x5opipn@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-02-10 13:49:50 -0500, Robert Haas wrote:
> I agree. While I feel sort of bad about missing this issue in review,
> I also feel like it's pretty surprising that there isn't something
> plugging this hole already. It feels unexpected that our FD management
> layer might hand you an FD that references the previous file that had
> the same name instead of the one that exists right now, and I don't
> think it's too much of a stretch of the imagination to suppose that
> this won't be the last problem we have if we don't fix it.

Arguably it's not really the FD layer that's the issue - it's on the smgr
level. But that's semantics...

> The main question in my mind is who is going to actually make that
> happen. It was your idea (I think), Thomas coded it, and my commit
> made it a live problem. So who's going to get something committed
> here?

I think it'd make sense for Thomas to commit the current patch for the
tablespace issues first. And then we can go from there?

> > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
> > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.
>
> What about movedb()?

My first reaction was that it should be fine, because RelFileNode.spcNode
would differ. But of course that doesn't protect against moving a database
back and forth. So yes, you're right, it's needed there too.

Are there other commands that could be problematic? ALTER TABLE SET TABLESPACE
should be fine, that allocates a new relfilenode.

> I definitely think it would be nice to cover this issue not only for
> database OIDs and tablespace OIDs but also for relfilenode OIDs.
> Otherwise we're right on the cusp of having the same bug in that case.
> pg_upgrade doesn't drop and recreate tables the way it does databases,
> but if somebody made it do that in the future, would they think about
> this? I bet not.

And even just plugging the wraparound aspect would be great. It's not actually
*that* hard to wrap oids around, because of toast.

> Dilip's been working on your idea of making relfilenode into a 56-bit
> quantity. That would make the problem of detecting wraparound go away.
> In the meantime, I suppose we could do think of trying to do something
> here:
>
> /* wraparound, or first post-initdb
> assignment, in normal mode */
> ShmemVariableCache->nextOid = FirstNormalObjectId;
> ShmemVariableCache->oidCount = 0;
>
> That's a bit sketch, because this is the OID counter, which isn't only
> used for relfilenodes.

I can see a few ways to deal with that:

1) Split nextOid into two, nextRelFileNode and nextOid. Have the wraparound
behaviour only in the nextRelFileNode case. That'd be a nice improvement,
but not entirely trivial, because we currently use GetNewRelFileNode() for
oid assignment as well (c.f. heap_create_with_catalog()).

2) Keep track of the last assigned relfilenode in ShmemVariableCache, and wait
for a barrier in GetNewRelFileNode() if the assigned oid is wrapped
around. While there would be a chance of "missing" a ->nextOid wraparound,
e.g. because of >2**32 toast oids being allocated, I think that could only
happen if there were no "wrapped around relfilenodes" allocated, so that
should be ok.

3) Any form of oid wraparound might cause problems, not just relfilenode
issues. So having a place to emit barriers on oid wraparound might be a
good idea.

One thing I wonder is how to make sure the path would be tested. Perhaps we
could emit the anti-wraparound barriers more frequently when assertions are
enabled?

> I'm not sure whether there would be problems waiting for a barrier here - I
> think we might be holding some lwlocks in some code paths.

It seems like a bad idea to call GetNewObjectId() with lwlocks held. From what
I can see the two callers of GetNewObjectId() both have loops that can go on
for a long time after a wraparound.

> > I think we need to add some debugging code to detect "fd to deleted file of
> > same name" style problems. Especially during regression tests they're quite
> > hard to notice, because most of the contents read/written are identical across
> > recycling.
>
> Maybe. If we just plug the holes here, I am not sure we need permanent
> instrumentation. But I'm also not violently opposed to it.

The question is how we can find all the places we need to add barriers emit &
wait to. I e.g. didn't initially didn't realize that there's wraparound danger
in WAL replay as well.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-02-10 22:26:59 Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Previous Message Thomas Munro 2022-02-10 22:12:49 Re: Windows now has fdatasync()