Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-08 12:11:36
Message-ID: 20210608121136.GD16435@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 08, 2021 at 01:54:41PM +0200, Matthias van de Meent wrote:
> On Tue, 8 Jun 2021 at 13:03, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> >
> > On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote:
> > > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > > I'll leave the instance running for a little bit before restarting (or kill-9)
> > > > in case someone requests more info.
> > >
> > > How about dumping the page image out, and sharing it with the list?
> > > This procedure should work fine from gdb:
> > >
> > > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB
> >
> > > I suggest that you dump the "page" pointer inside lazy_scan_prune(). I
> > > imagine that you have the instance already stuck in an infinite loop,
> > > so what we'll probably see from the page image is the page after the
> > > first prune and another no-progress prune.
> >
> > The cluster was again rejecting with "too many clients already".
> >
> > I was able to open a shell this time, but it immediately froze when I tried to
> > tab complete "pg_stat_acti"...
> >
> > I was able to dump the page image, though - attached. I can send you its
> > "data" privately, if desirable. I'll also try to step through this.
>
> Could you attach a dump of lazy_scan_prune's vacrel, all the global
> visibility states (GlobalVisCatalogRels, and possibly
> GlobalVisSharedRels, GlobalVisDataRels, and GlobalVisTempRels), and
> heap_page_prune's PruneState?

(gdb) p *vacrel
$56 = {rel = 0x7fe636faed28, indrels = 0x12b4440, nindexes = 1, do_index_vacuuming = true, do_index_cleanup = true, do_failsafe = false, bstrategy = 0x1210340, lps = 0x0, old_rel_pages = 81, old_live_tuples = 1100, relfrozenxid = 921613998, relminmxid = 53878631, OldestXmin = 926025113, FreezeLimit = 876025113, MultiXactCutoff = 49085856, relnamespace = 0x12b4460 "pg_catalog", relname = 0x12b4488 "pg_statistic", indname = 0x0, blkno = 1, offnum = 6, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP, dead_tuples = 0x127a980, rel_pages = 81, scanned_pages = 2, pinskipped_pages = 0, frozenskipped_pages = 0, tupcount_pages = 2, pages_removed = 0, lpdead_item_pages = 1, nonempty_pages = 1, lock_waiter_detected = false, new_rel_tuples = 0, new_live_tuples = 0, indstats = 0x12b4568, num_index_scans = 0, tuples_deleted = 0, lpdead_items = 3, new_dead_tuples = 0, num_tuples = 14, live_tuples = 14}

(gdb) p GlobalVisCatalogRels
$57 = {definitely_needed = {value = 926025113}, maybe_needed = {value = 926025112}}
(gdb) p GlobalVisSharedRels
$58 = {definitely_needed = {value = 926025113}, maybe_needed = {value = 926025112}}
(gdb) p GlobalVisDataRels
$59 = {definitely_needed = {value = 926025113}, maybe_needed = {value = 926025113}}
(gdb) p GlobalVisTempRels
$60 = {definitely_needed = {value = 926025113}, maybe_needed = {value = 926025113}}

I don't know when you want prstate from, but here it is at some point:

(gdb) p *prstate
$77 = {rel = 0x7fe636faed28, vistest = 0xe7bcc0 <GlobalVisCatalogRels>, old_snap_ts = 0, old_snap_xmin = 0, old_snap_used = false, new_prune_xid = 0, latestRemovedXid = 0, nredirected = 0, ndead = 0, nunused = 0,

> Additionally, the locals of lazy_scan_prune (more specifically, the
> 'offnum' when it enters heap_page_prune) would also be appreciated, as
> it helps indicate the tuple.

Breakpoint 1, heap_page_prune (relation=relation(at)entry=0x7fe636faed28, buffer=buffer(at)entry=411, vistest=vistest(at)entry=0xe7bcc0 <GlobalVisCatalogRels>, old_snap_xmin=old_snap_xmin(at)entry=0, old_snap_ts=old_snap_ts(at)entry=0,
report_stats=report_stats(at)entry=false, off_loc=off_loc(at)entry=0x12b433c) at pruneheap.c:225
225 pruneheap.c: No such file or directory.
(gdb) up
#1 0x00000000004fd9bf in lazy_scan_prune (vacrel=vacrel(at)entry=0x12b42d0, buf=buf(at)entry=411, blkno=blkno(at)entry=1, page=page(at)entry=0x2aaaab54be00 "J\f", vistest=vistest(at)entry=0xe7bcc0 <GlobalVisCatalogRels>,
prunestate=prunestate(at)entry=0x7fff7e4aae40) at vacuumlazy.c:1712
1712 vacuumlazy.c: No such file or directory.
(gdb) info locals
rel = 0x7fe636faed28
offnum = 6
maxoff = 28
itemid = 0x2aaaab54be2c
tuple = {t_len = 259, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 1}, ip_posid = 6}, t_tableOid = 2619, t_data = 0x2aaaab54db28}
res = <optimized out>
tuples_deleted = 0
lpdead_items = 0
new_dead_tuples = 0
num_tuples = 0
live_tuples = 0
nfrozen = 0

Maybe you need to know that this is also returning RECENTLY_DEAD.

Breakpoint 4, heap_prune_satisfies_vacuum (prstate=prstate(at)entry=0x7fff7e4a9180, tup=tup(at)entry=0x7fff7e4a8f10, buffer=buffer(at)entry=411) at pruneheap.c:423
423 in pruneheap.c
(gdb)
Run till exit from #0 heap_prune_satisfies_vacuum (prstate=prstate(at)entry=0x7fff7e4a9180, tup=tup(at)entry=0x7fff7e4a8f10, buffer=buffer(at)entry=411) at pruneheap.c:423
0x00000000004fa887 in heap_prune_chain (prstate=0x7fff7e4a9180, rootoffnum=6, buffer=411) at pruneheap.c:560
560 in pruneheap.c
Value returned is $72 = HEAPTUPLE_RECENTLY_DEAD

tup = {t_len = 259, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 1}, ip_posid = 6}, t_tableOid = 2619, t_data = 0x2aaaab54db28}
(gdb) p * htup
$82 = {t_choice = {t_heap = {t_xmin = 926014884, t_xmax = 926025112, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 926014884, datum_typmod = 926025112, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 0, bi_lo = 1}, ip_posid = 1}, t_infomask2 = 49183, t_infomask = 9475, t_hoff = 32 ' ', t_bits = 0x2aaaab54db3f "\377\377\177\004"}

--
Justin
System Administrator
Telsasoft
+1-952-707-8581

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Abbas Butt 2021-06-08 12:21:56 Re: Logical replication keepalive flood
Previous Message Matthias van de Meent 2021-06-08 11:54:41 Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic