Table AM and DROP TABLE [ Was: Table AM and DDLs]

From: Mats Kindahl <mats(at)timescale(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Table AM and DROP TABLE [ Was: Table AM and DDLs]
Date: 2021-04-05 19:57:12
Message-ID: CA+144243EHgLYcmBSd+WWDDOfZRzAux8F=L6X7TxFZRADR9mUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 22, 2021 at 12:16 AM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2021-03-03 22:15:18 +0100, Mats Kindahl wrote:
> > On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres(at)anarazel(dot)de>
> wrote:
> > Thanks for the answer and sorry about the late reply.
>
> Mine is even later ;)
>

:)

Seems I keep the tradition. :)

Thanks a lot for the pointers, I have some comments below both about DROP
TABLE and ALTER TABLE.

> > > I don't think that's quite right. It's not exactly obvious from the
> > > name, but RelationDropStorage() does not actually drop storage. Instead
> > > it *schedules* the storage to be dropped upon commit.
> > >
> > > The reason for deferring the dropping of table storage is that DDL in
> > > postgres is transactional. Therefore we cannot remove the storage at
> the
> > > moment the DROP TABLE is executed - only when the transaction that
> > > performed the DDL commits. Therefore just providing you with a callback
> > > that runs in heap_drop_with_catalog() doesn't really achieve much -
> > > you'd not have a way to execute the "actual" dropping of the relation
> at
> > > the later stage.
> > >
> >
> > Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
> > -> heap_drop_with_catalog) where the delete was just scheduled for
> deletion
> > but it appeared like this was the place to actually perform the "actual"
> > delete. Looking closer, I see this was the wrong location. However, the
> > intention was to get a callback when the "actual" delete should happen.
> > Before that, the blocks are still potentially alive and could be read, so
> > shouldn't be recycled.
> >
> > It seems the right location seems to be in the storage manager
> (smgr_unlink
> > in smgr.c), but that does not seem to be extensible, or are there any
> plans
> > to make it available so that you can implement something other than just
> > "magnetic disk"?
>
> There've been patches to add new types of storage below smgr.c, but not
> in way that can be done outside of build time. As far as I recall.
>

I have done some more research and I do not think it is necessary to extend
the storage layer. As a matter of fact, I think the patch I suggested is
the right approach: let me elaborate on why.

Let's look at how the implementation works with the heap access method (the
file heapam_handler.c) and for this case let's use CREATE TABLE, DROP
TABLE, and TRUNCATE TABLE (last one since that is supported in the Table AM
and hence is a good reference for the comparison).

Disregarding surrounding layers, we have three layers that are important
here:

1. Heap catalog layer (not sure what to call it, but it's the
src/backend/catalog/heap.c file)
2. AM layer (the src/backend/access/heap/heapam_handler.c file)
3. Storage layer (the src/backend/catalog/storage.c file) "code to
create and destroy physical storage for relations".

Looking at CREATE TRUNCATE, we have the following calls through these
layers:

1. In the heap catalog layer we have a call of heap_truncate_one_rel
which calls the table AM layer.
2. In the Table AM layer heapam_relation_nontransactional_truncate will
just call the storage layer to truncate the storage.
3. The storage layer gets called through RelationTruncate, which will
truncate the actual files.

Looking at CREATE TABLE, we have a similar pattern:

1. In the heap catalog layer heap_create_with_catalog is called, which
in turn calls heap_create, which will create the actual relcache and also
call the table AM layer if it is a relation, materialized view, or
toastvalue.
2. In the Table AM layer, heapam_relation_set_new_filenode is called
which will record the transaction identifiers and call the storage layer to
create the underlying storage.
3. In the storage layer, RelationCreateStorage will create the necessary
storage, but also register the table for deletion if the transaction is
aborted.

Note here that the storage layer remembers the table for deletion by saving
it in pendingDeletes, which is local to the storage layer.

Looking at DROP TABLE, we have a similar pattern, but am missing one step:

1. In the heap catalog layer the function heap_drop_with_catalog is
called, which releases the system cache and calls the storage layer to drop
the relation
2. In the storage layer, the function RelationDropStorage is called,
which will record the table to be dropped in the pendingDeletes

When committing (or aborting) the transaction, there are two calls that are
interesting, in this order:

1. CallXactCallbacks which calls registered callbacks
2. smgrDoPendingDeletes, which calls the storage layer directly to
perform the actual deletion, if necessary.

Now, suppose that we want to replace the storage layer with a different
one. It is straightforward to replace it by implementing the Table AM
methods above, but we are missing a callback on dropping the table. If we
have that, we can record the table-to-be-dropped in a similar manner to how
the heap AM does it and register a transaction callback using
RegisterXactCallback.

>
> > > Before I explain some more: Could you describe in a bit more detail
> what
> > > kind of optimization you'd like to make?
> > >
> >
> > This is not really about any optimizations, it more about a good API for
> > tables and managing storage. If a memory table can be implemented
> entirely
> > in the extension and storage managed fully, there is a lot of interesting
> > potential for various implementations of table backends. For this to
> work I
> > think it is necessary to be able to handle schema changes for the backend
> > storage in addition to scans, inserts, updates, and deletes, but I am not
> > sure if it is already possible in some way that I haven't discovered or
> if
> > I should just try to propose something (making the storage manager API
> > extensible seems like a good first attempt).
>
> As long as you have a compatible definition of what is acceptable "in
> place" ALTER TABLE (e.g. adding new columns, changing between compatible
> types), and what requires a table rewrite (e.g. an incompatible column
> type change), I don't see a real problem. Except for the unlink thing
> above.
>
> Any schema change requiring a table rewrite will trigger a new relation
> to be created, which in turn will involve tableam. After that you'll
> just get called back to re-insert all the tuples in the original
> relation.
>
> If you want a different definition on what needs a rewrite, good luck,
> it'll be a heck of a lot more work.
>

No, this should work fine.

> > Due to postgres' transactional DDL you cannot really change the storage
> > > layout of *existing data* when that DDL command is executed - the data
> > > still needs to be interpretable in case the DDL is rolled back
> > > (including when crashing).
> > >
> >
> > No, didn't expect this, but some means to see that a schema change is
> about
> > to happen.
>
> For anything that's not in-place you'll see a new table being created
> (c.f. ATRewriteTables() calling make_new_heap()). The relfilenode
> identifying the data (as opposed to the oid, identifying a relation),
> will then be swapped with the current table's relfilenode via
> finish_heap_swap().
>
>
> > > Other changes, e.g. changing the type of a column "sufficiently", will
> > > cause a so called table rewrite. Which means that a new relation will
> be
> > > created (including a call to relation_set_new_filenode()), then that
> new
> > > relation will get all the new data inserted, and then
> > > pg_class->relfilenode for the "original" relation will be changed to
> the
> > > "rewritten" table (there's two variants of this, once for rewrites due
> > > to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).
>
> > But that is not visible in the access method interface. If I add debug
> > output to the memory table, I only see a call to needs_toast_table. If
> > there were a new call to create a new block and some additional
> information
> > about , this would be possible to handle.
>
> It should be. If I e.g. do
>
> CREATE TABLE blarg(id int4 not null);
> I get one call to table_relation_set_new_filenode()
>
> #0 table_relation_set_new_filenode (rel=0x7f84c2417b70,
> newrnode=0x7f84c2417b70, persistence=112 'p', freezeXid=0x7ffc8c61263c,
> minmulti=0x7ffc8c612638)
> at /home/andres/src/postgresql/src/include/access/tableam.h:1596
> #1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612900 "blarg",
> relnamespace=2200, reltablespace=0, relid=3016410, relfilenode=3016410,
> accessmtd=2,
> tupDesc=0x55b191d2a8c8, relkind=114 'r', relpersistence=112 'p',
> shared_relation=false, mapped_relation=false,
> allow_system_table_mods=false,
> relfrozenxid=0x7ffc8c61263c, relminmxid=0x7ffc8c612638) at
> /home/andres/src/postgresql/src/backend/catalog/heap.c:436
> #2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612900
> "blarg", relnamespace=2200, reltablespace=0, relid=3016410, reltypeid=0,
> reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x55b191d2a8c8,
> cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p',
> shared_relation=false,
> mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0,
> use_user_acl=true, allow_system_table_mods=false, is_internal=false,
> relrewrite=0,
> typaddress=0x0) at
> /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
> #3 0x000055b19030002a in DefineRelation (stmt=0x55b191d31478, relkind=114
> 'r', ownerId=10, typaddress=0x0,
> queryString=0x55b191c35bc0 "CREATE TABLE blarg(id int8 not null
>
> then when I do
> ALTER TABLE blarg ALTER COLUMN id TYPE int8;
> I see
> #0 table_relation_set_new_filenode (rel=0x7f84c241f2a8,
> newrnode=0x7f84c241f2a8, persistence=112 'p', freezeXid=0x7ffc8c61275c,
> minmulti=0x7ffc8c612758)
> at /home/andres/src/postgresql/src/include/access/tableam.h:1596
> #1 0x000055b1901e9116 in heap_create (relname=0x7ffc8c612860
> "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407,
> relfilenode=3016407,
> accessmtd=2, tupDesc=0x7f84c24162a0, relkind=114 'r',
> relpersistence=112 'p', shared_relation=false, mapped_relation=false,
> allow_system_table_mods=true,
> relfrozenxid=0x7ffc8c61275c, relminmxid=0x7ffc8c612758) at
> /home/andres/src/postgresql/src/backend/catalog/heap.c:436
> #2 0x000055b1901eab28 in heap_create_with_catalog (relname=0x7ffc8c612860
> "pg_temp_3016404", relnamespace=2200, reltablespace=0, relid=3016407,
> reltypeid=0,
> reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x7f84c24162a0,
> cooked_constraints=0x0, relkind=114 'r', relpersistence=112 'p',
> shared_relation=false,
> mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0,
> use_user_acl=false, allow_system_table_mods=true, is_internal=true,
> relrewrite=3016404,
> typaddress=0x0) at
> /home/andres/src/postgresql/src/backend/catalog/heap.c:1291
>
>
>
> > I *was* expecting either a call of set_filenode with a new xact id, or
> > something like that, and with some information so that you can locate the
> > schema change planned (e.g., digging through pg_class and friends), I
> just
> > don't see that when I add debug output.
>
> You should. And it'll have the new table "schema" associated. E.g. in
> the above example the new table will have
> rel->rd_att->natts == 1
> rel->rd_att->attrs[0].atttypid == 20 (i.e. int8)
>

I didn't get a callback because I did ADD COLUMN and that works
differently: it does not call set_filenode until you either try to insert
something or run a vacuum. Thanks for the pointers, it helped a lot. I need
to look over the code a little more.

Thanks,
Mats Kindahl

>
> Greetings,
>
> Andres Freund
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bryn Llewellyn 2021-04-05 20:06:36 Re: Have I found an interval arithmetic bug?
Previous Message David Steele 2021-04-05 19:23:05 Re: Stronger safeguard for archive recovery not to miss data