From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Date: 2018-10-25 20:26:22
Message-ID: 20181025202622.d3x4y4ch7m4pxwnn@alvherre.pgsql
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Here's my take on this feature, owing to David Rowley's version.

Firstly, I took Robert's advice and removed the CONCURRENTLY keyword
from the syntax. We just do it that way always. When there's a default
partition, only that partition is locked with an AEL; all the rest is
locked with ShareUpdateExclusive only.

I added some isolation tests for it -- they all pass for me.

There are two main ideas supporting this patch:

1. The Partition descriptor cache module (partcache.c) now contains a
long-lived hash table that lists all the current partition descriptors;
when an invalidation message is received for a relation, we unlink the
partdesc from the hash table *but do not free it*. The hash
table-linked partdesc is rebuilt again in the future, when requested, so
many copies might exist in memory for one partitioned table.

2. Snapshots have their own cache (hash table) of partition descriptors.
If a partdesc is requested and the snapshot has already obtained that
partdesc, the original one is returned -- we don't request a new one
from partcache.

Then there are a few other implementation details worth mentioning:

3. parallel query: when a worker starts on a snapshot that has a
partition descriptor cache, we need to transmit those partdescs from
leader via shmem ... but we cannot send the full struct, so we just send
the OID list of partitions, then rebuild the descriptor in the worker.
Side effect: if a partition is detached right between the leader taking
the partdesc and the worker starting, the partition loses its
relpartbound column, so it's not possible to reconstruct the partdesc.
In this case, we raise an error. Hopefully this should be rare.

4. If a partitioned table is dropped, but was listed in a snapshot's
partdesc cache, and then parallel query starts, the worker will try to
restore the partdesc for that table, but there are no catalog rows for
it. The implementation choice here is to ignore the table and move on.
I would like to just remove the partdesc from the snapshot, but that
would require a relcache inval callback, and a) it'd kill us to scan all
snapshots for every relation drop; b) it doesn't work anyway because we
don't have any way to distinguish invals arriving because of DROP from
invals arriving because of anything else, say ANALYZE.

5. snapshots are copied a lot. Copies share the same hash table as the
"original", because surely all copies should see the same partition
descriptor. This leads to the pinning/unpinning business you see for
the structs in snapmgr.c.

Some known defects:

6. this still leaks memory. Not as terribly as my earlier prototypes,
but clearly it's something that I need to address.

7. I've considered the idea of tracking snapshot-partdescs in resowner.c
to prevent future memory leak mistakes. Not done yet. Closely related
to item 6.

8. Header changes may need some cleanup yet -- eg. I'm not sure
snapmgr.h compiles standalone.

9. David Rowley recently pointed out that we can modify
CREATE TABLE .. PARTITION OF to likewise not obtain AEL anymore.
Apparently it just requires removal of three lines in MergeAttributes.

Álvaro Herrera
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
attach-concurrently.patch text/x-diff 65.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Nikolay Samokhvalov 2018-10-25 21:10:02 Re: Using old master as new replica after clean switchover
Previous Message Alvaro Herrera 2018-10-25 19:47:20 Re: PostgreSQL vs SQL/XML Standards