Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Date: 2022-03-03 18:47:20
Message-ID: CA+TgmoZu27SnS-GASPmXXUXbTcD0yqxbo7OJ3n_Q1NPgfmRbSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 3, 2022 at 1:28 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I can't remember that verify() is the one that accesses conflict.db large
> > while cause_eviction() is the one that accesses postgres.replace_sb for more
> > than like 15 seconds.
>
> For more than 15seconds? The whole test runs in a few seconds for me...

I'm not talking about running the test. I'm talking about reading it
and trying to keep straight what is happening. The details of which
function is accessing which database keep going out of my head.
Quickly. Maybe because I'm dumb, but I think some better naming could
help, just in case other people are dumb, too.

> > As to (5), the identifiers are just primary and standby, without
> > mentioning the database name, which adds to the confusion, and there
> > are no comments explaining why we have two nodes.
>
> I don't follow this one - physical rep doesn't do anything below the database
> level?

Right but ... those handles are connected to a particular DB on that
node, not just the node in general.

> > I think it would be helpful to find a way to divide the test case up
> > into sections that are separated from each other visually, and explain
> > the purpose of each test a little more in a comment. For example I'm
> > struggling to understand the exact purpose of this bit:
> >
> > +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
> > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
> > +
> > +verify($node_primary, $node_standby, 3,
> > + "restored contents as expected");
> >
> > I'm all for having test coverage of VACUUM FULL, but it isn't clear to
> > me what this does that is related to FD reuse.
>
> It's just trying to clean up prior damage, so the test continues to later
> ones. Definitely not needed once there's a fix. But it's useful while trying
> to make the test actually detect various corruptions.

Ah, OK, I definitely did not understand that before.

> > Similarly for the later ones. I generally grasp that you are trying to
> > make sure that things are OK with respect to FD reuse in various
> > scenarios, but I couldn't explain to you what we'd be losing in terms
> > of test coverage if we removed this line:
> >
> > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");
> >
> > I am not sure how much all of this potential cleanup matters because
> > I'm pretty skeptical that this would be stable in the buildfarm.
>
> Which "potential cleanup" are you referring to?

I mean, whatever improvements you might consider making based on my comments.

> > It relies on a lot of assumptions about timing and row sizes and details of
> > when invalidations are sent that feel like they might not be entirely
> > predictable at the level you would need to have this consistently pass
> > everywhere. Perhaps I am too pessimistic?
>
> I don't think the test passing should be dependent on row size, invalidations
> etc to a significant degree (unless the tables are too small to reach s_b
> etc)? The exact symptoms of failures are highly unstable, but obviously we'd
> fix them in-tree before committing a test. But maybe I'm missing a dependency?
>
> FWIW, the test did pass on freebsd, linux, macos and windows with the
> fix. After a few iterations of improving the fix ;)

Hmm. Well, I admit that's already more than I would have expected....

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2022-03-03 19:00:14 Re: casting operand to proper type in BlockIdGetBlockNumber
Previous Message Andres Freund 2022-03-03 18:28:29 Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: