Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, jianingy(dot)yang(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table
Date: 2019-03-08 10:22:46
Message-ID: e39595de-f5b7-aa84-4528-73a5307f7997@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 2019/03/07 20:36, Amit Langote wrote:
> On Thu, Mar 7, 2019 at 11:17 AM Amit Langote
> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> The problem start when ALTER TABLE users ALTER COLUMN is executed.
>> create table users(user_id int, name varchar(64), unique (user_id, name))
>> partition by list(user_id);
>>
>> create table users_000 partition of users for values in(0);
>> create table users_001 partition of users for values in(1);
>> select relname, relfilenode from pg_class where relname like 'users%';
>> relname │ relfilenode
>> ────────────────────────────┼─────────────
>> users │ 16441
>> users_000 │ 16446
>> users_000_user_id_name_key │ 16449
>> users_001 │ 16451
>> users_001_user_id_name_key │ 16454
>> users_user_id_name_key │ 16444
>> (6 rows)
>>
>> alter table users alter column name type varchar(127);
>> select relname, relfilenode from pg_class where relname like 'users%';
>> relname │ relfilenode
>> ────────────────────────────┼─────────────
>> users │ 16441
>> users_000 │ 16446
>> users_000_user_id_name_key │ 16444 <=== duplicated
>> users_001 │ 16451
>> users_001_user_id_name_key │ 16444 <=== duplicated
>> users_user_id_name_key │ 16444 <=== duplicated
>> (6 rows)
>
> I checked why users_000's and user_0001's indexes end up reusing
> users_user_id_name_key's relfilenode. At the surface, it's because
> DefineIndex(<parent's-index-to-be-recreated>) is carrying oldNode =
> <old-parents-index's-relfilenode> in IndexStmt, which is recursively
> passed down to DefineIndex(<child-indexes-to-be-recreated>). This
> DefineIndex() chain is running due to ATPostAlterTypeCleanup() on the
> parent rel. This surface problem may be solved in DefineIndex() by
> just resetting oldNode in each child IndexStmt before recursing, but
> that means child indexes are recreated with new relfilenodes. That
> solves the immediate problem of relfilenodes being wrongly duplicated,
> that's leading to madness such as SMgrRelationHash corruption being
> seen in the original bug report.

This doesn't happen in HEAD, because in HEAD we got 807ae415c5, which
changed things so that partitioned relations always have their relfilenode
set to 0. So, there's no question of parent's relfilenode being passed to
children and hence being duplicated.

But that also means child indexes are unnecessarily rewritten, that is,
with new relfilenodes.

> But, the root problem seems to be that ATPostAlterTypeCleanup() on
> child tables isn't setting up their own
> DefineIndex(<child-index-to-be-rewritten>) step. That's because the
> parent's ATPostAlterTypeCleanup() dropped child copies of the UNIQUE
> constraint due to dependencies (+ CCI). So, ATExecAlterColumnType()
> on child relations isn't able to find the constraint on the individual
> child relations to turn into their own
> DefineIndex(<child-index-to-be-rewritten>). If we manage to handle
> each relation's ATPostAlterTypeCleanup() independently, child's
> recreated indexes will be able to reuse their old relfilenodes and
> everything will be fine. But maybe that will require significant
> overhaul of how this post-alter-type-cleanup occurs?

We could try to solve this dividing ATPostAlterTypeCleanup processing into
two functions:

1 The first one runs right after ATExecAlterColumnType() is finished for a
given table (like it does today), which then runs ATPostAlterTypeParse to
generate commands for constraints and/or indexes to re-add. This part
won't drop the old constraints/indexes just yet, so child
constraints/indexes will remain for ATExecAlterColumnType to see when
executed for the children.

2. Dropping the old constraints/indexes is the responsibility of the 2nd
part, which runs just before executing ATExecReAddIndex or
ATExecAddConstraint (with is_readd = true), so that the new constraints
don't collide with the existing ones.

This arrangement allows step 1 to generate the commands to recreate even
the child indexes such that the old relfilenode can be be preserved by
setting IndexStmt.oldNode.

Attached patch is a very rough sketch, which fails some regression tests,
but I ran out of time today.

Thanks,
Amit

Attachment Content-Type Size
wip-post-alter-type-cleanup-divide.patch text/plain 5.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Peter Eisentraut 2019-03-08 11:00:40 Re: BUG #15631: Generated as identity field in a temporary table with on commit drop corrupts system catalogs
Previous Message PG Bug reporting form 2019-03-08 07:29:10 BUG #15677: Crash while deleting from partitioned table

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2019-03-08 10:29:30 Re: Update does not move row across foreign partitions in v11
Previous Message Shaoqi Bai 2019-03-08 10:14:41 Add tablespace tap test to pg_rewind