Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-08 11:54:41
Message-ID: CAEze2Wi6WrXo_PajFmwfved1AsU1mdXdA=+NsBqZ5E3sXszX1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 8 Jun 2021 at 13:03, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>
> On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > I'll leave the instance running for a little bit before restarting (or kill-9)
> > > in case someone requests more info.
> >
> > How about dumping the page image out, and sharing it with the list?
> > This procedure should work fine from gdb:
> >
> > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB
>
> > I suggest that you dump the "page" pointer inside lazy_scan_prune(). I
> > imagine that you have the instance already stuck in an infinite loop,
> > so what we'll probably see from the page image is the page after the
> > first prune and another no-progress prune.
>
> The cluster was again rejecting with "too many clients already".
>
> I was able to open a shell this time, but it immediately froze when I tried to
> tab complete "pg_stat_acti"...
>
> I was able to dump the page image, though - attached. I can send you its
> "data" privately, if desirable. I'll also try to step through this.

Could you attach a dump of lazy_scan_prune's vacrel, all the global
visibility states (GlobalVisCatalogRels, and possibly
GlobalVisSharedRels, GlobalVisDataRels, and GlobalVisTempRels), and
heap_page_prune's PruneState?

Additionally, the locals of lazy_scan_prune (more specifically, the
'offnum' when it enters heap_page_prune) would also be appreciated, as
it helps indicate the tuple.

I've been looking at whatever might have done this, and I'm currently
stuck on lacking information in GlobalVisCatalogRels and the
PruneState.

One curiosity that I did notice is that the t_xmax of the problematic
tuples has been exactly one lower than the OldestXmin. Not weird, but
a curiosity.

With regards,

Matthias van de Meent.

PS. Attached a few of my current research notes, which are mainly
comparisons between heap_prune_satisfies_vacuum and
HeapTupleSatisfiesVacuum.

Attachment Content-Type Size
research_notes.txt text/plain 1.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2021-06-08 12:11:36 Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Previous Message Dilip Kumar 2021-06-08 11:46:26 Re: Decoding speculative insert with toast leaks memory