Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-08 21:38:37
Message-ID: CAH2-Wz=4yg7PBaqmxJjhxEJYPNz7VZC3_NDJ7_RHcnicmX+B7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 8, 2021 at 2:23 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> I'm not sure what you're suggesting ? Maybe I should add some NOTICES there.

Here is one approach that might work: Can you check if the assertion
added by the attached patch fails very quickly with your test case?

This does nothing more than trigger an assertion failure in the event
of retrying a second time for any given heap page. Theoretically that
could happen without there being any bug -- in principle we might have
to retry several times for the same page. In practice the chances of
it happening even once are vanishingly low, though -- so two times
strongly signals a bug. It was quite hard to hit the "goto restart"
even once during my testing. There is still no test coverage for the
line of code because it's so hard to hit.

If you find that the assertion is hit pretty quickly with the same
workload then you've all but reproduced the issue, probably in far
less time. And, if you know that there were no concurrently aborting
transactions then you can be 100% sure that you have reproduced the
issue -- this goto is only supposed to be executed when a transaction
that was in progress during the heap_page_prune() aborts after it
returns, but before we call HeapTupleSatisfiesVacuum() for one of the
aborted-xact tuples. It's supposed to be a super narrow thing.

> I'm not sure why/if pg_statistic is special, but I guess when analyze happens,
> it gets updated, and eventually processed by autovacuum.

pg_statistic is probably special, though only in a superficial way: it
is the system catalog that tends to be the most frequently vacuumed in
practice.

> In pg14, the parent table is auto-analyzed.

I wouldn't expect that to matter. The "ANALYZE portion" of the VACUUM
ANALYZE won't have started at the point that we get stuck.

--
Peter Geoghegan

Attachment Content-Type Size
0001-Assert-that-restart-behavior-happens-once-only.patch application/octet-stream 1.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-06-08 21:52:14 logical replication of truncate command with trigger causes Assert
Previous Message Jeff Davis 2021-06-08 21:29:25 Re: Make unlogged table resets detectable