wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Date: 2022-02-09 22:00:04
Message-ID: 20220209220004.kb3dgtn2x2k2gtdm@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I was working on rebasing the AIO branch. Tests started to fail after, but it
turns out that the problem exists independent of AIO.

The problem starts with

commit aa01051418f10afbdfa781b8dc109615ca785ff9
Author: Robert Haas <rhaas(at)postgresql(dot)org>
Date: 2022-01-24 14:23:15 -0500

pg_upgrade: Preserve database OIDs.

Commit 9a974cbcba005256a19991203583a94b4f9a21a9 arranged to preserve
relfilenodes and tablespace OIDs. For similar reasons, also arrange
to preserve database OIDs.

after this commit the database contents for databases that exist before
pg_upgrade runs, and the databases pg_upgrade (re-)creates can have the same
oid. Consequently the RelFileNode of files in the "pre-existing" database and
the re-created database will be the same.

That's a problem: There is no reliable invalidation mechanism forcing all
processes to clear SMgrRelationHash when a database is dropped /
created. Which means that backends can end up using file descriptors for the
old, dropped, database, when accessing contents in the new database. This is
most likely to happen to bgwriter, but could be any backend (backends in other
databases can end up opening files in another database to write out dirty
victim buffers during buffer replacement).

That's of course dangerous, because we can end up reading the wrong data (by
reading from the old file), or loosing writes (by writing to the old file).

Now, this isn't a problem that's entirely new - but previously it requried
relfilenodes to be recycled due to wraparound or such. The timeframes for that
are long enough that it's very likely that *something* triggers smgr
invalidation processing. E.g. has

if (FirstCallSinceLastCheckpoint())
{
/*
* After any checkpoint, close all smgr files. This is so we
* won't hang onto smgr references to deleted files indefinitely.
*/
smgrcloseall();
}

in its mainloop. Previously for the problem to occur there would have to be
"oid wraparound relfilenode recycling" within a single checkpoint window -
quite unlikely. But now a bgwriter cycle just needs to span a drop/create
database, easier to hit.

I think the most realistic way to address this is the mechanism is to use the
mechanism from
https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com

If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

We perhaps could avoid all relfilenode recycling, issues on the fd level, by
adding a barrier whenever relfilenode allocation wraps around. But I'm not
sure how easy that is to detect.

I think we might actually be able to remove the checkpoint from dropdb() by
using barriers?

I think we need to add some debugging code to detect "fd to deleted file of
same name" style problems. Especially during regression tests they're quite
hard to notice, because most of the contents read/written are identical across
recycling.

On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
== 0.

b) might also work on a few other operating systems, but I have not verified
that. a) has the advantage of providing a filename, which makes it a lot
easier to understand the problem. It'd probably make sense to use b) on all
the operating systems it works on, and then a) on linux for a more useful
error message.

The checks for this would need to be in FileRead/FileWrite/... and direct
calls to pread etc. I think that's an acceptable cost. I have a patch for
this, but it's currently based on the AIO tree. if others agree it'd be useful
to have, I'll adapt it to work on master.

Comments?

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Aliaksandr Kalenik 2022-02-09 22:19:15 Re: [PATCH] nodeindexscan with reorder memory leak
Previous Message Joshua Brindle 2022-02-09 21:51:18 Re: [PATCH v2] use has_privs_for_role for predefined roles