Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

From: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Langote <amitlangote09(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, jianingy(dot)yang(at)gmail(dot)com, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table
Date: 2019-03-27 02:40:12
Message-ID: d51e28a8-a725-b7ef-a6cf-8186e775b852@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 2019/03/08 19:22, Amit Langote wrote:
> On 2019/03/07 20:36, Amit Langote wrote:
>> On Thu, Mar 7, 2019 at 11:17 AM Amit Langote
>> <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> The problem start when ALTER TABLE users ALTER COLUMN is executed.
>>> create table users(user_id int, name varchar(64), unique (user_id, name))
>>> partition by list(user_id);
>>>
>>> create table users_000 partition of users for values in(0);
>>> create table users_001 partition of users for values in(1);
>>> select relname, relfilenode from pg_class where relname like 'users%';
>>> relname │ relfilenode
>>> ────────────────────────────┼─────────────
>>> users │ 16441
>>> users_000 │ 16446
>>> users_000_user_id_name_key │ 16449
>>> users_001 │ 16451
>>> users_001_user_id_name_key │ 16454
>>> users_user_id_name_key │ 16444
>>> (6 rows)
>>>
>>> alter table users alter column name type varchar(127);
>>> select relname, relfilenode from pg_class where relname like 'users%';
>>> relname │ relfilenode
>>> ────────────────────────────┼─────────────
>>> users │ 16441
>>> users_000 │ 16446
>>> users_000_user_id_name_key │ 16444 <=== duplicated
>>> users_001 │ 16451
>>> users_001_user_id_name_key │ 16444 <=== duplicated
>>> users_user_id_name_key │ 16444 <=== duplicated
>>> (6 rows)
>>
>> I checked why users_000's and user_0001's indexes end up reusing
>> users_user_id_name_key's relfilenode. At the surface, it's because
>> DefineIndex(<parent's-index-to-be-recreated>) is carrying oldNode =
>> <old-parents-index's-relfilenode> in IndexStmt, which is recursively
>> passed down to DefineIndex(<child-indexes-to-be-recreated>). This
>> DefineIndex() chain is running due to ATPostAlterTypeCleanup() on the
>> parent rel. This surface problem may be solved in DefineIndex() by
>> just resetting oldNode in each child IndexStmt before recursing, but
>> that means child indexes are recreated with new relfilenodes. That
>> solves the immediate problem of relfilenodes being wrongly duplicated,
>> that's leading to madness such as SMgrRelationHash corruption being
>> seen in the original bug report.
>
> This doesn't happen in HEAD, because in HEAD we got 807ae415c5, which
> changed things so that partitioned relations always have their relfilenode
> set to 0. So, there's no question of parent's relfilenode being passed to
> children and hence being duplicated.
>
> But that also means child indexes are unnecessarily rewritten, that is,
> with new relfilenodes.
>
>> But, the root problem seems to be that ATPostAlterTypeCleanup() on
>> child tables isn't setting up their own
>> DefineIndex(<child-index-to-be-rewritten>) step. That's because the
>> parent's ATPostAlterTypeCleanup() dropped child copies of the UNIQUE
>> constraint due to dependencies (+ CCI). So, ATExecAlterColumnType()
>> on child relations isn't able to find the constraint on the individual
>> child relations to turn into their own
>> DefineIndex(<child-index-to-be-rewritten>). If we manage to handle
>> each relation's ATPostAlterTypeCleanup() independently, child's
>> recreated indexes will be able to reuse their old relfilenodes and
>> everything will be fine. But maybe that will require significant
>> overhaul of how this post-alter-type-cleanup occurs?
>
> We could try to solve this dividing ATPostAlterTypeCleanup processing into
> two functions:
>
> 1 The first one runs right after ATExecAlterColumnType() is finished for a
> given table (like it does today), which then runs ATPostAlterTypeParse to
> generate commands for constraints and/or indexes to re-add. This part
> won't drop the old constraints/indexes just yet, so child
> constraints/indexes will remain for ATExecAlterColumnType to see when
> executed for the children.
>
> 2. Dropping the old constraints/indexes is the responsibility of the 2nd
> part, which runs just before executing ATExecReAddIndex or
> ATExecAddConstraint (with is_readd = true), so that the new constraints
> don't collide with the existing ones.
>
> This arrangement allows step 1 to generate the commands to recreate even
> the child indexes such that the old relfilenode can be be preserved by
> setting IndexStmt.oldNode.
>
> Attached patch is a very rough sketch, which fails some regression tests,
> but I ran out of time today.

I'm thinking of adding this to open items under Older Bugs. Attached the
patch that I had posted on -bugs, but it's only a rough sketch as
described above, not a full fix.

Link to the original bug report:
https://www.postgresql.org/message-id/flat/15672-b9fa7db32698269f%40postgresql.org

Thanks,
Amit

Attachment Content-Type Size
wip-post-alter-type-cleanup-divide.patch text/plain 5.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2019-03-27 02:56:20 Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table
Previous Message Tom Lane 2019-03-26 22:53:56 Re: BUG #15715: UPDATE using "in (subquery for update limit 1)" does not respect the limit in subquery

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-03-27 02:56:20 Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table
Previous Message Michael Paquier 2019-03-27 02:36:50 Re: Shouldn't current_schema() be at least PARALLEL RESTRICTED?