Re: recovering from "found xmin ... from before relfrozenxid ..."

From: Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, MBeena Emerson <mbeena(dot)emerson(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Date: 2020-08-06 08:11:27
Message-ID: CA+fd4k5jp5aGH6GGvcyO41aSXVo4AVPZo2q9FDx8c7BDWU1J_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi Robert,
>
> Thanks for the review. Please find my comments inline:
>
> On Sat, Aug 1, 2020 at 12:18 AM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >
> > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> > > Attached is the new version of patch that addresses the comments from Andrey and Beena.
> >
> > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
> >
> > the -> a
> >
> > I also suggest: heap table -> relation, because we might want to
> > extend this to other cases later.
> >
>
> Corrected.
>
> > +-- toast table.
> > +begin;
> > +create table ttab(a text);
> > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
> > 65), '') from generate_series(1,10000);
> > +select * from ttab where xmin = 2;
> > + a
> > +---
> > +(0 rows)
> > +
> > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
> > + heap_force_freeze
> > +-------------------
> > +
> > +(1 row)
> > +
> >
> > I don't understand the point of this. You're not testing the function
> > on the TOAST table; you're testing it on the main table when there
> > happens to be a TOAST table that is probably getting used for
> > something. But that's not really relevant to what is being tested
> > here, so as written this seems redundant with the previous cases.
> >
>
> Yeah, it's being tested on the main table, not on a toast table. I've
> removed this test-case and also restricted direct access to the toast
> table using heap_force_kill/freeze functions. I think we shouldn't be
> using these functions to do any changes in the toast table. We will
> only use these functions with the main table and let VACUUM remove the
> corresponding data chunks (pointed by the tuple that got removed from
> the main table).
>
> Another option would be to identify all the data chunks corresponding
> to the tuple (ctid) being killed from the main table and remove them
> one by one. We will only do this if the tuple from the main table that
> has been marked as killed has an external storage. We will have to add
> a bunch of code for this otherwise we can let VACUUM do this for us.
> Let me know your thoughts on this.
>
> > +-- test pg_surgery functions with the unsupported relations. Should fail.
> >
> > Please name the specific functions being tested here in case we add
> > more in the future that are tested separately.
> >
>
> Done.
>
> > +++ b/contrib/pg_surgery/heap_surgery_funcs.c
> >
> > I think we could drop _funcs from the file name.
> >
>
> Done.
>
> > +#ifdef PG_MODULE_MAGIC
> > +PG_MODULE_MAGIC;
> > +#endif
> >
> > The #ifdef here is not required, and if you look at other contrib
> > modules you'll see that they don't have it.
> >
>
> Okay, done.
>
> > I don't like all the macros at the top of the file much. CHECKARRVALID
> > is only used in one place, so it seems to me that you might as well
> > just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.
> >
>
> Done.
>
> > Once you do that, heap_force_common() can just compute the number of
> > array elements once, instead of doing it once inside ARRISEMPTY, then
> > again inside SORT, and then a third time to initialize ntids. You can
> > just have a local variable in that function that is set once and then
> > used as needed. Then you are only doing ARRNELEMS once, so you can get
> > rid of that macro too. The same technique can be used to get rid of
> > ARRPTR. So then all the macros go away without introducing any code
> > duplication.
> >
>
> Done.
>
> > +/* Options to forcefully change the state of a heap tuple. */
> > +typedef enum HTupleForceOption
> > +{
> > + FORCE_KILL,
> > + FORCE_FREEZE
> > +} HTupleForceOption;
> >
> > I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
> > enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE.
>
> Done.
>
> Also, how about
> > option -> operation?
> >
>
> I think both look okay to me.
>
> > + return heap_force_common(fcinfo, FORCE_KILL);
> >
> > I think it might be more idiomatic to use PG_RETURN_DATUM here. I
> > know it's the same thing, though, and perhaps I'm even wrong about the
> > prevailing style.
> >
>
> Done.
>
> > + Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);
> >
> > I think this is unnecessary. It's an enum with 2 values.
> >
>
> Removed.
>
> > + if (ARRISEMPTY(ta))
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("empty tid array")));
> >
> > I don't see why this should be an error. Why can't we just continue
> > normally and if it does nothing, it does nothing? You'd need to change
> > the do..while loop to a while loop so that the end condition is tested
> > at the top, but that seems fine.
> >
>
> I think it's okay to have this check. So, just left it as-is. We do
> have such checks in other contrib modules as well wherever the array
> is being passed as an input to the function.
>
> > + rel = relation_open(relid, AccessShareLock);
> >
> > Maybe we should take RowExclusiveLock, since we are going to modify
> > rows. Not sure how much it matters, though.
> >
>
> I don't know how it would make a difference, but still as you said
> replaced AccessShare with RowExclusive.
>
> > + if (!superuser() && GetUserId() != rel->rd_rel->relowner)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> > + errmsg("must be superuser or object owner to use %s.",
> > + force_opt == FORCE_KILL ? "heap_force_kill()" :
> > + "heap_force_freeze()")));
> >
> > This is the wrong way to do a permissions check, and it's also the
> > wrong way to write an error message about having failed one. To see
> > the correct method, grep for pg_class_aclcheck. However, I think that
> > we shouldn't in general trust the object owner to do this, unless the
> > super-user gave permission. This is a data-corrupting operation, and
> > only the boss is allowed to authorize it. So I think you should also
> > add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
> > have this check as a backup. Then, the superuser is always allowed,
> > and if they choose to GRANT EXECUTE on this function to some users,
> > those users can do it for their own relations, but not others.
> >
>
> Done.
>
> > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > + errmsg("only heap AM is supported")));
> > +
> > + check_relation_relkind(rel);
> >
> > Seems like these checks are in the wrong order.
>
> I don't think there is anything wrong with the order. I can see the
> same order in other contrib modules as well.
>
> Also, maybe you could
> > call the function something like check_relation_ok() and put the
> > permissions test, the relkind test, and the relam test all inside of
> > it, just to tighten up the code in this main function a bit.
> >
>
> Yeah, I've added a couple of functions named sanity_check_relation and
> sanity_check_tid_array and shifted all the sanity checks there.
>
> > + if (noffs > maxoffset)
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("number of offsets specified for block %u exceeds the max
> > offset number %u",
> > + blkno, maxoffset)));
> >
> > Hmm, this doesn't seem quite right. The actual problem is if an
> > individual item pointer's offset number is greater than maxoffset,
> > which can be true even if the total number of offsets is less than
> > maxoffset. So I think you need to remove this check and add a check
> > inside the loop which follows that offnos[i] is in range.
> >
>
> Agreed and done.
>
> > The way you've structured that loop is actually problematic -- I don't
> > think we want to be calling elog() or ereport() inside a critical
> > section. You could fix the case that checks for an invalid force_opt
> > by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
> > HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
> > NOTICE case you have here is a bigger problem.
>
> Done.
>
> You really cannot
> > modify the buffer like this and then decide, oops, never mind, I think
> > I won't mark it dirty or write WAL for the changes. If you do that,
> > the buffer is still in memory, but it's now been modified. A
> > subsequent operation that modifies it will start with the altered
> > state you created here, quite possibly leading to WAL that cannot be
> > correctly replayed on the standby. In other words, you've got to
> > decide for certain whether you want to proceed with the operation
> > *before* you enter the critical section. You also need to emit any
> > messages before or after the critical section. So you could:
> >
>
> This is still not clear. I think Robert needs to respond to my earlier comment.
>
> > I believe this violates our guidelines on message construction. Have
> > two completely separate messages -- and maybe lose the word "already":
> >
> > "skipping tid (%u, %u) because it is dead"
> > "skipping tid (%u, %u) because it is unused"
> >
> > The point of this is that it makes it easier for translators.
> >
>
> Done.
>
> > I see very little point in what verify_tid() is doing. Before using
> > each block number, we should check that it's less than or equal to a
> > cached value of RelationGetNumberOfBlocks(rel). That's necessary in
> > any case to avoid funny errors; and then the check here against
> > specifically InvalidBlockNumber is redundant. For the offset number,
> > same thing: we need to check each offset against the page's
> > PageGetMaxOffsetNumber(page); and if we do that then we don't need
> > these checks.
> >
>
> Done.
>
> Please check the attached patch for the changes.

I also looked at this version patch and have some small comments:

+ Oid relid = PG_GETARG_OID(0);
+ ArrayType *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+ ItemPointer tids;
+ int ntids;
+ Relation rel;
+ Buffer buf;
+ Page page;
+ ItemId itemid;
+ BlockNumber blkno;
+ OffsetNumber *offnos;
+ OffsetNumber offno,
+ noffs,
+ curr_start_ptr,
+ next_start_ptr,
+ maxoffset;
+ int i,
+ nskippedItems,
+ nblocks;

You declare all variables at the top of heap_force_common() function
but I think we can declare some variables such as buf, page inside of
the do loop.

---
+ if (offnos[i] > maxoffset)
+ {
+ ereport(NOTICE,
+ errmsg("skipping tid (%u, %u) because it
contains an invalid offset",
+ blkno, offnos[i]));
+ continue;
+ }

If all tids on a page take the above path, we will end up logging FPI
in spite of modifying nothing on the page.

---
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ log_newpage_buffer(buf, true);

I think we need to set the returned LSN by log_newpage_buffer() to the page lsn.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2020-08-06 08:17:44 Re: PROC_IN_ANALYZE stillborn 13 years ago
Previous Message kato-sho@fujitsu.com 2020-08-06 07:25:00 RE: Creating foreign key on partitioned table is too slow