Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6
Date: 2019-05-01 04:43:34
Message-ID: 8749.1556685814@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2019-04-30 15:53:07 -0700, Andres Freund wrote:
>> I'll move the test into a new "reindex_catalog" test, with a comment
>> explaining that the failure cases necessitating that are somewhere
>> between bugs, ugly warts, an hard to fix edge cases.

> Just pushed that.

locust is kind of unimpressed:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2019-05-01%2003%3A12%3A13

The relevant bit of log is

2019-05-01 05:24:47.527 CEST [97690:429] pg_regress/create_view LOG: statement: DROP SCHEMA temp_view_test CASCADE;
2019-05-01 05:24:47.605 CEST [97690:430] pg_regress/create_view LOG: statement: DROP SCHEMA testviewschm2 CASCADE;
2019-05-01 05:24:47.858 CEST [97694:1] [unknown] LOG: connection received: host=[local]
2019-05-01 05:24:47.863 CEST [97694:2] [unknown] LOG: connection authorized: user=pgbuildfarm database=regression
2019-05-01 05:24:47.878 CEST [97694:3] pg_regress/reindex_catalog LOG: statement: REINDEX TABLE pg_class;
2019-05-01 05:24:48.887 CEST [97694:4] pg_regress/reindex_catalog ERROR: deadlock detected
2019-05-01 05:24:48.887 CEST [97694:5] pg_regress/reindex_catalog DETAIL: Process 97694 waits for ShareLock on transaction 2559; blocked by process 97690.
Process 97690 waits for RowExclusiveLock on relation 1259 of database 16387; blocked by process 97694.
Process 97694: REINDEX TABLE pg_class;
Process 97690: DROP SCHEMA testviewschm2 CASCADE;
2019-05-01 05:24:48.887 CEST [97694:6] pg_regress/reindex_catalog HINT: See server log for query details.
2019-05-01 05:24:48.887 CEST [97694:7] pg_regress/reindex_catalog CONTEXT: while checking uniqueness of tuple (12,71) in relation "pg_class"
2019-05-01 05:24:48.887 CEST [97694:8] pg_regress/reindex_catalog STATEMENT: REINDEX TABLE pg_class;
2019-05-01 05:24:48.904 CEST [97690:431] pg_regress/create_view LOG: disconnection: session time: 0:00:03.748 user=pgbuildfarm database=regression host=[local]

which is mighty confusing at first glance, but I think the explanation is
that what the postmaster is reporting is process 97690's *latest* query,
not what it's currently doing. What it's really currently doing at the
moment of the deadlock is cleaning out its temporary schema after the
client disconnected. So this says you were careless about where to insert
the reindex_catalog test in the test schedule: it can't be after anything
that creates any temp objects. That seems like kind of a problem :-(.
We could put it second, after the tablespace test, but that would mean
that we're reindexing after very little churn has happened in the
catalogs, which doesn't seem like much of a stress test.

Another fairly interesting thing is that this log includes the telltale

2019-05-01 05:24:48.887 CEST [97694:7] pg_regress/reindex_catalog CONTEXT: while checking uniqueness of tuple (12,71) in relation "pg_class"

Why did I have to dig to find that information in HEAD? Have we lost
some useful context reporting? (Note this run is in the v10 branch.)

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2019-05-01 05:19:14 Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Previous Message Amit Kapila 2019-05-01 04:38:14 Re: POC: Cleaning up orphaned files using undo logs