Re: Backpatch b61d161c14

From: Andres Freund <andres(at)anarazel(dot)de>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Subject: Re: Backpatch b61d161c14
Date: 2020-06-22 20:09:39
Message-ID: 20200622200939.jkuniq3vtiumeszj@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-06-22 10:35:47 +0530, Amit Kapila wrote:
> I propose to backpatch b61d161c14 [1] (Introduce vacuum errcontext to
> display additional information.). In the recent past, we have seen an
> error report similar to "ERROR: found xmin 2157740646 from before
> relfrozenxid 1197" from multiple EDB customers. A similar report is
> seen on pgsql-bugs as well [2] which I think has triggered the
> implementation of this feature for v13. Such reports mostly indicate
> database corruption rather than any postgres bug which is also
> indicated by the error-code (from before relfrozenxid) for this
> message. I think there is a good reason to back-patch this as multiple
> users are facing similar issues. This patch won't fix this issue but
> it will help us in detecting the problematic part of the heap/index
> and then if users wish they can delete the portion of data that
> appeared to be corrupted and resume the operations on that relation.
>
> I have tried to back-patch this for v12 and attached is the result.
> The attached patch passes make check-world but I have yet to test it
> manually and also prepare the patch for other branches once we agree
> on this proposal.

I think having the additional information in the back branches would be
good. But on the other hand I think this is a somewhat large change
to backpatch, and it hasn't yet much real world exposure.

I'm also uncomfortable with the approach of just copying all of
LVRelStats in several places:

> /*
> @@ -1580,9 +1648,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
> int uncnt = 0;
> TransactionId visibility_cutoff_xid;
> bool all_frozen;
> + LVRelStats olderrinfo;
>
> pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
>
> + /* Update error traceback information */
> + olderrinfo = *vacrelstats;
> + update_vacuum_error_info(vacrelstats, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
> + blkno, NULL);
> +
> START_CRIT_SECTION();
>
> for (; tupindex < vacrelstats->num_dead_tuples; tupindex++)
> @@ -1659,6 +1733,11 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
> *vmbuffer, visibility_cutoff_xid, flags);
> }
>
> + /* Revert to the previous phase information for error traceback */
> + update_vacuum_error_info(vacrelstats,
> + olderrinfo.phase,
> + olderrinfo.blkno,
> + olderrinfo.indname);
> return tupindex;
> }

To me that's a very weird approach. It's fragile because we need to be
sure that there's no updates to the wrong LVRelStats for important
things, and it has a good bit of potential to be inefficient because
LVRelStats isn't exactly small. This pretty much relies on the compiler
doing good enough escape analysis to realize that most parts of
olderrinfo aren't touched.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2020-06-22 20:23:58 Re: Default setting for enable_hashagg_disk
Previous Message Tom Lane 2020-06-22 20:01:22 Re: More tzdb fun: POSIXRULES is being deprecated upstream