Quick Links

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Date:	2022-02-22 09:11:21
Message-ID:	20220222091121.fod2abymd7qhwti7@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2022-02-10 14:26:59 -0800, Andres Freund wrote:
> On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> > It seems like I should go ahead and do that today, and we can study
> > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
>
> Yes.

I wrote a test to show the problem. While looking at the code I unfortunately
found that CREATE DATABASE ... OID isn't the only problem. Two consecutive
ALTER DATABASE ... SET TABLESPACE also can cause corruption.

The test doesn't work on < 15 (due to PostgresNode renaming and use of
allow_in_place_tablespaces). But manually it's unfortunately quite
reproducible :(

Start a server with

shared_buffers = 1MB
autovacuum=off
bgwriter_lru_maxpages=1
bgwriter_delay=10000

these are just to give more time / to require smaller tables.

Start two psql sessions

s1: \c postgres
s1: CREATE DATABASE conflict_db;
s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg';
s1: CREATE EXTENSION pg_prewarm;

s2: \c conflict_db
s2: CREATE TABLE large(id serial primary key, dataa text, datab text);
s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 4000) g(i);
s2: UPDATE large SET datab = 1;

-- prevent AtEOXact_SMgr
s1: BEGIN;

-- Fill shared_buffers with other contents. This needs to write out the prior dirty contents
-- leading to opening smgr rels / file descriptors
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

s2: \c postgres
-- unlinks the files
s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace;
-- now new files with the same relfilenode exist
s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default;
s2: \c conflict_db
-- dirty buffers again
s2: UPDATE large SET datab = 2;

-- this again writes out the dirty buffers. But because nothing forced the smgr handles to be closed,
-- fds pointing to the *old* file contents are used. I.e. we'll write to the wrong file
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

-- verify that everything is [not] ok
s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10;

┌───────┬───────┐
│ datab │ count │
├───────┼───────┤
│ 1 │ 2117 │
│ 2 │ 67 │
└───────┴───────┘
(2 rows)

oops.

I don't yet have a good idea how to tackle this in the backbranches.

I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with
deleted files. That seems to work (almost) everywhere. Optionally, on
linux, use readlink(/proc/$pid/fd/$fd) to get the filename.
2) On linux, iterate over PGPROC to get a list of pids. Then iterate over
/proc/$pid/fd/ to check for deleted files. This only can be done after
we've just made sure there's no fds open to deleted files.

Greetings,

Andres Freund

In response to

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: at 2022-02-10 22:26:59 from Andres Freund

Responses

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: at 2022-02-22 09:40:01 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Juan José Santamaría Flecha	2022-02-22 09:20:23	Re: pg_receivewal.exe unhandled exception in zlib1.dll
Previous Message	Daniel Gustafsson	2022-02-22 08:59:09	Re: Patch a potential memory leak in describeOneTableDetails()