Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic
Date: 2021-06-08 21:23:37
Message-ID: 20210608212337.GO16435@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 08, 2021 at 01:52:40PM -0700, Peter Geoghegan wrote:
> On Tue, Jun 8, 2021 at 12:27 PM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > > They're running this:
> > > | PGOPTIONS="--deadlock_timeout=333ms -cstatement-timeout=3600s" psql -c "REINDEX INDEX CONCURRENTLY $i"
> > > And if it times out, it then runs: $PSQL "DROP INDEX CONCURRENTLY $bad"
> > ...
> > > $ date -d @1623121264
> > > Mon Jun 7 22:01:04 CDT 2021
>
> Perhaps reindex was waiting on the VACUUM process to finish, while
> VACUUM was (in effect) busy waiting on the REINDEX to finish.

But when the reindex exited, the vacuum kept spinning until I sent SIGABRT 12
hours later.

> The other bug that you reported back in July of last year [1] (which
> involved a "REINDEX INDEX pg_class_tblspc_relfilenode_index") was
> pretty easy to recreate, just by running the REINDEX in a tight loop.
> Could you describe how tricky it is to repro this issue now?

I didn't try to reproduce it, but now hit it twice in 3 days.
(Actuallly, I did try to reproduce it, by running tight loops around
vacuum/analyze pg_statistic, which didn't work. Maybe because reindex is
what's important.)

I mentioned that we've been running pg14b1 since 2021-05-20. So it ran fine for
13 days before breaking in an obvious way.

OH - in the first instance, I recorded the stuck process, but not its
timestamp. It looks like that autovacuum process *also* started right after
10pm, which is when the reindex job starts. So it seems like REINDEX may
trigger this pretty consistently:

(gdb) frame 4
#4 heap_vacuum_rel (rel=0x7f0349466d28, params=0x1c77b7c, bstrategy=<optimized out>) at vacuumlazy.c:612
612 vacuumlazy.c: No such file or directory.
(gdb) info locals
starttime = 676177375524485

$ date -d '2000-01-01 UTC + 676177375seconds'
Fri Jun 4 22:02:55 CDT 2021

> If you instrument the "goto retry" code added to lazy_scan_prune() by
> commit 8523492d, then you might notice that it is hit in contexts that
> it was never intended to work with. If you can reduce reproducing the
> problem to reproducing hitting that goto in the absence of an aborted
> transaction, then it might be a lot easier to produce a simple repro.

I'm not sure what you're suggesting ? Maybe I should add some NOTICES there.

I'm not sure why/if pg_statistic is special, but I guess when analyze happens,
it gets updated, and eventually processed by autovacuum.

The main table here is a partitioned table which receives UPDATEs which moves
tuples into a different partition (probably more often than what's
recommended).

autovacuum_analyze_threshold | 2
autovacuum_analyze_scale_factor | 0.005
autovacuum_vacuum_scale_factor | 0.005
log_autovacuum_min_duration | 9000
checkpoint_timeout | 60
wal_level | minimal

In pg14, the parent table is auto-analyzed.

--
Justin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2021-06-08 21:29:25 Re: Make unlogged table resets detectable
Previous Message Peter Geoghegan 2021-06-08 20:52:40 Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic