Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-08 12:27:14
Message-ID: CAEze2WgT63ggfP7KXdxC7d1xnxxWKFoeYs=1DeaGvc+XF=xyEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 8 Jun 2021 at 14:11, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
> On Tue, Jun 08, 2021 at 01:54:41PM +0200, Matthias van de Meent wrote:
> > On Tue, 8 Jun 2021 at 13:03, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > >
> > > On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> > > > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > > I'll leave the instance running for a little bit before restarting (or kill-9)
> > > > > in case someone requests more info.
> > > >
> > > > How about dumping the page image out, and sharing it with the list?
> > > > This procedure should work fine from gdb:
> > > >
> > > > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB
> > >
> > > > I suggest that you dump the "page" pointer inside lazy_scan_prune(). I
> > > > imagine that you have the instance already stuck in an infinite loop,
> > > > so what we'll probably see from the page image is the page after the
> > > > first prune and another no-progress prune.
> > >
> > > The cluster was again rejecting with "too many clients already".
> > >
> > > I was able to open a shell this time, but it immediately froze when I tried to
> > > tab complete "pg_stat_acti"...
> > >
> > > I was able to dump the page image, though - attached. I can send you its
> > > "data" privately, if desirable. I'll also try to step through this.
> >
> > Could you attach a dump of lazy_scan_prune's vacrel, all the global
> > visibility states (GlobalVisCatalogRels, and possibly
> > GlobalVisSharedRels, GlobalVisDataRels, and GlobalVisTempRels), and
> > heap_page_prune's PruneState?
>
> (gdb) p *vacrel
> $56 = {... OldestXmin = 926025113, ...}
>
> (gdb) p GlobalVisCatalogRels
> $57 = {definitely_needed = {value = 926025113}, maybe_needed = {value = 926025112}}

This maybe_needed is older than the OldestXmin, which indeed gives us
this problematic behaviour:

heap_prune_satisfies_vacuum considers 1 more transaction to be
unvacuumable, and thus indeed won't vacuum that tuple that
HeapTupleSatisfiesVacuum does want to be vacuumed.

The new open question is now: Why is
GlobalVisCatalogRels->maybe_needed < OldestXmin? IIRC
GLobalVisCatalogRels->maybe_needed is constructed from the same
ComputeXidHorizonsResult->catalog_oldest_nonremovable which later is
returned to be used in vacrel->OldestXmin.

> Maybe you need to know that this is also returning RECENTLY_DEAD.

I had expected that, but good to have confirmation.

Thanks for the information!

With regards,

Matthias van de Meent.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2021-06-08 12:53:50 Re: locking [user] catalog tables vs 2pc vs logical rep
Previous Message Abbas Butt 2021-06-08 12:21:56 Re: Logical replication keepalive flood