Re: Table AM and DDLs

From: Mats Kindahl <mats(at)timescale(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Table AM and DDLs
Date: 2021-03-03 21:15:18
Message-ID: CA+14427mvgNJzekhyeVU-1VR7pbZPk1WDM=4BhyBrrqTOeoBVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Feb 23, 2021 at 2:11 AM Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>

Hi Andres,

Thanks for the answer and sorry about the late reply.

> On 2021-02-22 08:33:21 +0100, Mats Kindahl wrote:
> > I started to experiment with the table access method interface to see if
> it
> > can be used for some ideas I have.
>
> Cool.
>
>
> > The `relation_set_new_filenode` is indirectly called from
> > `heap_create_with_catalog`, but there is no corresponding callback from
> > `heap_drop_with_catalog`. It also seems like the intention is that the
> > callback should call `RelationCreateStorage` itself (makes sense, since
> the
> > access method knows about how to use the storage), so it seems natural to
> > add a `relation_reset_filenode` to the table AM that is called from
> > `heap_drop_with_catalog` for tables and add that to the heap
> implementation
> > (see the attached patch).
>
> I don't think that's quite right. It's not exactly obvious from the
> name, but RelationDropStorage() does not actually drop storage. Instead
> it *schedules* the storage to be dropped upon commit.
>
> The reason for deferring the dropping of table storage is that DDL in
> postgres is transactional. Therefore we cannot remove the storage at the
> moment the DROP TABLE is executed - only when the transaction that
> performed the DDL commits. Therefore just providing you with a callback
> that runs in heap_drop_with_catalog() doesn't really achieve much -
> you'd not have a way to execute the "actual" dropping of the relation at
> the later stage.
>

Yeah, I found the chain (performDeletion -> deleteOneObject -> doDeletion
-> heap_drop_with_catalog) where the delete was just scheduled for deletion
but it appeared like this was the place to actually perform the "actual"
delete. Looking closer, I see this was the wrong location. However, the
intention was to get a callback when the "actual" delete should happen.
Before that, the blocks are still potentially alive and could be read, so
shouldn't be recycled.

It seems the right location seems to be in the storage manager (smgr_unlink
in smgr.c), but that does not seem to be extensible, or are there any plans
to make it available so that you can implement something other than just
"magnetic disk"?

> > Creating new blocks for a table is straightforward to implement by using
> > the `relation_set_new_filenode` callback where you can create new memory
> > blocks for a relation, but I cannot find a way to clean up those blocks
> > when the table is dropped nor a way to handle a change of the schema for
> a
> > table.
>
> What precisely do you mean with the "handle a change of the schema" bit?
> I.e. what would you like to do, and what do you think is preventing you
> from it? But before you answer see my next point below.
>
>
> > Altering the schema does not seem to be covered at all, but this is
> > something that table access methods need to know about since it might
> want
> > to optimize the internal storage when the schema changes. I have not been
> > able to find any discussions around this, but it seems like a natural
> thing
> > to do with a table. Have I misunderstood how this works?
>
> Due to postgres' transactional DDL you cannot really change the storage
> layout of *existing data* when that DDL command is executed - the data
> still needs to be interpretable in case the DDL is rolled back
> (including when crashing).
>

No, didn't expect this, but some means to see that a schema change is about
to happen.

> Before I explain some more: Could you describe in a bit more detail what
> kind of optimization you'd like to make?
>

This is not really about any optimizations, it more about a good API for
tables and managing storage. If a memory table can be implemented entirely
in the extension and storage managed fully, there is a lot of interesting
potential for various implementations of table backends. For this to work I
think it is necessary to be able to handle schema changes for the backend
storage in addition to scans, inserts, updates, and deletes, but I am not
sure if it is already possible in some way that I haven't discovered or if
I should just try to propose something (making the storage manager API
extensible seems like a good first attempt).

> Back to schema change handling:
>
> For some schema changes postgres assumes that they can be done
> "in-place", e.g. adding a column to a table.
>
> Other changes, e.g. changing the type of a column "sufficiently", will
> cause a so called table rewrite. Which means that a new relation will be
> created (including a call to relation_set_new_filenode()), then that new
> relation will get all the new data inserted, and then
> pg_class->relfilenode for the "original" relation will be changed to the
> "rewritten" table (there's two variants of this, once for rewrites due
> to ALTER TABLE and a separate one for VACUUM FULL/CLUSTER).
>

But that is not visible in the access method interface. If I add debug
output to the memory table, I only see a call to needs_toast_table. If
there were a new call to create a new block and some additional information
about , this would be possible to handle.

I *was* expecting either a call of set_filenode with a new xact id, or
something like that, and with some information so that you can locate the
schema change planned (e.g., digging through pg_class and friends), I just
don't see that when I add debug output.

> When the transaction containing such a rewrite commits that
> ->relfilenode change becomes visible for everyone, and the old
> relfilenode will be deleted.
>
>
> This means that right now there's no easy way to store the data anywhere
> but in the file referenced by pg_class.relfilenode. I don't think
> anybody would object on principle to making the necessary infrastructure
> changes to support storing data elsewhere - but I think it'll also not
> quite as simple as the change you suggested :(.
>

Seems it is not. It is fine to store table information in pg_class, but to
implement "interesting" backends there need to be a way to handle schema
changes (among other things), but I do not see how, or I have misunderstood
how this is expected to work.

I have a lot more questions about the table access API, but this is the
first thing.

Best wishes,
Mats Kindahl

> Greetings,
>
> Andres Freund
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2021-03-03 21:19:13 Re: [HACKERS] Custom compression methods
Previous Message Andrew Dunstan 2021-03-03 21:07:13 Re: buildfarm windows checks / tap tests on windows