Re: recovering from "found xmin ... from before relfrozenxid ..."

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com>
Cc: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, MBeena Emerson <mbeena(dot)emerson(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: recovering from "found xmin ... from before relfrozenxid ..."
Date: 2020-07-31 18:47:51
Message-ID: CA+TgmoZLpeFNg5ypkXfKdvKktZKSwxprnB3TinDUDz12fdKScw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
> Attached is the new version of patch that addresses the comments from Andrey and Beena.

+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"

the -> a

I also suggest: heap table -> relation, because we might want to
extend this to other cases later.

+-- toast table.
+begin;
+create table ttab(a text);
+insert into ttab select string_agg(chr(floor(random() * 26)::int +
65), '') from generate_series(1,10000);
+select * from ttab where xmin = 2;
+ a
+---
+(0 rows)
+
+select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
+ heap_force_freeze
+-------------------
+
+(1 row)
+

I don't understand the point of this. You're not testing the function
on the TOAST table; you're testing it on the main table when there
happens to be a TOAST table that is probably getting used for
something. But that's not really relevant to what is being tested
here, so as written this seems redundant with the previous cases.

+-- test pg_surgery functions with the unsupported relations. Should fail.

Please name the specific functions being tested here in case we add
more in the future that are tested separately.

+++ b/contrib/pg_surgery/heap_surgery_funcs.c

I think we could drop _funcs from the file name.

+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif

The #ifdef here is not required, and if you look at other contrib
modules you'll see that they don't have it.

I don't like all the macros at the top of the file much. CHECKARRVALID
is only used in one place, so it seems to me that you might as well
just inline it and lose the macro. Likewise for SORT and ARRISEMPTY.

Once you do that, heap_force_common() can just compute the number of
array elements once, instead of doing it once inside ARRISEMPTY, then
again inside SORT, and then a third time to initialize ntids. You can
just have a local variable in that function that is set once and then
used as needed. Then you are only doing ARRNELEMS once, so you can get
rid of that macro too. The same technique can be used to get rid of
ARRPTR. So then all the macros go away without introducing any code
duplication.

+/* Options to forcefully change the state of a heap tuple. */
+typedef enum HTupleForceOption
+{
+ FORCE_KILL,
+ FORCE_FREEZE
+} HTupleForceOption;

I suggest un-abbreviating HTuple to HeapTuple and un-abbreviating the
enum members to HEAP_FORCE_KILL and HEAP_FORCE_FREE. Also, how about
option -> operation?

+ return heap_force_common(fcinfo, FORCE_KILL);

I think it might be more idiomatic to use PG_RETURN_DATUM here. I
know it's the same thing, though, and perhaps I'm even wrong about the
prevailing style.

+ Assert(force_opt == FORCE_KILL || force_opt == FORCE_FREEZE);

I think this is unnecessary. It's an enum with 2 values.

+ if (ARRISEMPTY(ta))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("empty tid array")));

I don't see why this should be an error. Why can't we just continue
normally and if it does nothing, it does nothing? You'd need to change
the do..while loop to a while loop so that the end condition is tested
at the top, but that seems fine.

+ rel = relation_open(relid, AccessShareLock);

Maybe we should take RowExclusiveLock, since we are going to modify
rows. Not sure how much it matters, though.

+ if (!superuser() && GetUserId() != rel->rd_rel->relowner)
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser or object owner to use %s.",
+ force_opt == FORCE_KILL ? "heap_force_kill()" :
+ "heap_force_freeze()")));

This is the wrong way to do a permissions check, and it's also the
wrong way to write an error message about having failed one. To see
the correct method, grep for pg_class_aclcheck. However, I think that
we shouldn't in general trust the object owner to do this, unless the
super-user gave permission. This is a data-corrupting operation, and
only the boss is allowed to authorize it. So I think you should also
add REVOKE EXECUTE FROM PUBLIC statements to the SQL file, and then
have this check as a backup. Then, the superuser is always allowed,
and if they choose to GRANT EXECUTE on this function to some users,
those users can do it for their own relations, but not others.

+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only heap AM is supported")));
+
+ check_relation_relkind(rel);

Seems like these checks are in the wrong order. Also, maybe you could
call the function something like check_relation_ok() and put the
permissions test, the relkind test, and the relam test all inside of
it, just to tighten up the code in this main function a bit.

+ if (noffs > maxoffset)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("number of offsets specified for block %u exceeds the max
offset number %u",
+ blkno, maxoffset)));

Hmm, this doesn't seem quite right. The actual problem is if an
individual item pointer's offset number is greater than maxoffset,
which can be true even if the total number of offsets is less than
maxoffset. So I think you need to remove this check and add a check
inside the loop which follows that offnos[i] is in range.

The way you've structured that loop is actually problematic -- I don't
think we want to be calling elog() or ereport() inside a critical
section. You could fix the case that checks for an invalid force_opt
by just doing if (op == HEAP_FORCE_KILL) { ... } else { Assert(op ==
HEAP_FORCE_FREEZE); ... }, or by using a switch with no default. The
NOTICE case you have here is a bigger problem. You really cannot
modify the buffer like this and then decide, oops, never mind, I think
I won't mark it dirty or write WAL for the changes. If you do that,
the buffer is still in memory, but it's now been modified. A
subsequent operation that modifies it will start with the altered
state you created here, quite possibly leading to WAL that cannot be
correctly replayed on the standby. In other words, you've got to
decide for certain whether you want to proceed with the operation
*before* you enter the critical section. You also need to emit any
messages before or after the critical section. So you could:

1. If you encounter a TID that's unused or dead, skip it silently.
-or-
2. Loop over offsets twice. The first time, ERROR if you find any one
that is unused or dead. Then start a critical section. Loop again and
do the real work.
-or-
3. Like #2, but emit a NOTICE about a unused or dead item rather than
an ERROR, and skip the critical section and the second loop if you did
that >0 times.
-or-
4. Like #3, but don't skip anything just because you emitted a NOTICE
about the page.

#3 is closest to the behavior you have now, but I'm not sure what else
it has going for it. It doesn't seem like particularly intuitive
behavior that finding a dead or unused TID should cause other item
TIDs on the same page not to get processed while still permitting TIDs
on other pages to get processed. I don't think that's the behavior
users will be expecting. I think my vote is for #4, which will emit a
NOTICE about any TID that is dead or unused -- and I guess also about
any TID whose offset number is out of range -- but won't actually skip
any operations that can be performed. But there are decent arguments
for #1 or #2 too.

+ (errmsg("skipping tid (%u, %u) because it is already marked %s",
+ blkno, offnos[i],
+ ItemIdIsDead(itemid) ? "dead" : "unused")));

I believe this violates our guidelines on message construction. Have
two completely separate messages -- and maybe lose the word "already":

"skipping tid (%u, %u) because it is dead"
"skipping tid (%u, %u) because it is unused"

The point of this is that it makes it easier for translators.

I see very little point in what verify_tid() is doing. Before using
each block number, we should check that it's less than or equal to a
cached value of RelationGetNumberOfBlocks(rel). That's necessary in
any case to avoid funny errors; and then the check here against
specifically InvalidBlockNumber is redundant. For the offset number,
same thing: we need to check each offset against the page's
PageGetMaxOffsetNumber(page); and if we do that then we don't need
these checks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message James Coleman 2020-07-31 18:51:09 Re: [DOC] Document concurrent index builds waiting on each other
Previous Message Robert Haas 2020-07-31 17:50:49 Re: refactoring basebackup.c