Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
Date: 2021-04-27 14:06:09
Message-ID: CA+HiwqGTjrgbA7EbgUvzbc2VuS5ijmWHMw+pQE4U1riMpT9Yww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 27, 2021 at 4:34 PM Amit Langote <amitlangote09(at)gmail(dot)com> wrote:
> Thanks for the updated patch. I've been reading it, but I noticed a
> bug in 8aba9322511f, which I thought you'd want to know to make a note
> of when committing this one.
>
> So we decided in 8aba9322511f that it is okay to make the memory
> context in which a transient partdesc is allocated a child of
> PortalContext so that it disappears when the portal does. But
> PortalContext is apparently NULL when the planner runs, at least in
> the "simple" query protocol, so any transient partdescs built by the
> planner would effectively leak. Not good.
>
> With this patch, partdesc_nodetached is no longer transient, so the
> problem doesn't exist.
>
> I will write more about the updated patch in a bit...

The first thing that struck me about partdesc_nodetached is that it's
not handled in RelationClearRelation(), where we (re)set a regular
partdesc to NULL so that the next RelationGetPartitionDesc() has to
build it from scratch. I think partdesc_nodetached and the xmin
should likewise be reset. Also in load_relcache_init_file(), although
nothing serious there.

That said, I think I may have found a couple of practical problems
with partdesc_nodetached, or more precisely with having it
side-by-side with regular partdesc. Maybe they can be fixed, so the
problems are not as such deal breakers for the patch's main idea. The
problems can be seen when different queries in a serializable
transaction have to use both the regular partdesc and
partdesc_detached in a given relcache. For example, try the following
after first creating a situation where the table p has a
detach-pending partition p2 (for values in (2) and a live partition p1
(for values in (1)).

begin isolation level serializable;
insert into p values (1);
select * from p where a = 1;
insert into p values (1);

The 1st insert succeeds but the 2nd fails with:

ERROR: no partition of relation "p" found for row
DETAIL: Partition key of the failing row contains (a) = (1).

I haven't analyzed this error very closely but there is another
situation which causes a crash due to what appears to be a conflict
with rd_pdcxt's design:

-- run in a new session
begin isolation level serializable;
select * from p where a = 1;
insert into p values (1);
select * from p where a = 1;

The 2nd select crashes:

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

The crash occurs because the planner gets handed a stale copy of
partdesc_nodetached for the 2nd select. It gets stale, because the
context it's allocated in gets made a child of rd_pdcxt, which is in
turn assigned the context of the regular partdesc when it is built for
the insert query. Any child contexts of rd_pdcxt are deleted as soon
as the Relation's refcount goes to zero, taking it with
partdesc_nodetached. Note it is this code in
RelationBuildPartitionDesc():

/*
* But first, a kluge: if there's an old rd_pdcxt, it contains an old
* partition descriptor that may still be referenced somewhere.
* Preserve it, while not leaking it, by reattaching it as a child
* context of the new rd_pdcxt. Eventually it will get dropped by
* either RelationClose or RelationClearRelation.
*/
if (rel->rd_pdcxt != NULL)
MemoryContextSetParent(rel->rd_pdcxt, new_pdcxt);
rel->rd_pdcxt = new_pdcxt;

I think we may need a separate context for partdesc_nodetached, likely
with the same kludges as rd_pdcxt. Maybe the first problem will go
away with that as well.

Few other minor things I noticed:

+ * Store it into relcache. For snapshots built excluding detached
+ * partitions, which we save separately, we also record the
+ * pg_inherits.xmin of the detached partition that was omitted; this
+ * informs a future potential user of such a cached snapshot.

The "snapshot" in the 1st and the last sentence should be "partdesc"?

+ * We keep two partdescs in relcache: rd_partdesc_nodetached excludes
+ * partitions marked concurrently being detached, while rd_partdesc includes
+ * them.

IMHO, describing rd_partdesc first in the sentence would be better.
Like: rd_partdesc includes all partitions including any that are being
concurrently detached, while rd_partdesc_nodetached excludes them.

--
Amit Langote
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-04-27 14:07:27 Should we document the behaviour of TRUNCATE skipping repeated relations?
Previous Message Justin Pryzby 2021-04-27 14:03:47 Re: Performance degradation of REFRESH MATERIALIZED VIEW