Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]

From: Mats Kindahl <mats(at)timescale(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Aleksander Alekseev <aleksander(at)timescale(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Table AM and DROP TABLE [ Was: Table AM and DDLs]
Date: 2022-07-29 10:36:13
Message-ID: CA+14425KObZ6kfoePtX1XRjV7r+hy8MAJTO7vdTvXkz4E2R6gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 16, 2022 at 10:07 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

> On 27/09/2021 14:59, Aleksander Alekseev wrote:
> > Hi hackers,
> >
> >> As a matter of fact, I think the patch I suggested is the right
> approach:
> >> let me elaborate on why.
> >> [...]
> >> It is straightforward to replace it by implementing the Table AM methods
> >> above, but we are missing a callback on dropping the table. If we have
> that,
> >> we can record the table-to-be-dropped in a similar manner to how the
> heap AM
> >> does it and register a transaction callback using RegisterXactCallback.
> >
> > Since no one objected in 5 months, I assume Mats made a good point. At
> least,
> > personally, I can't argue.
>
> I agree that having a table AM callback at relation drop would make it
> more consistent with creating and truncating a relation. Then again, the
> indexam API doesn't have a drop-callback either.
>

That is actually a good point. We could add an on-drop-callback for the
indexam as well, if we add it for tableam. Haven't looked at that though,
so if you think it should be added, I can investigate.

> But what can you actually do in the callback? WAL replay of dropping the
> storage needs to work without running any AM-specific code. It happens
> as part of replaying a commit record. So whatever action you do in the
> callback will not be executed at WAL replay.

Also, because the callback
> merely *schedules* things to happen at commit, it cannot generate
> separate WAL records about dropping resources either.
>

Digressing slightly: This is actually a drawback and I have been looking
for a way to do things on recovery.

Just to have an example: if you want to have an in-memory table that is
distributed the problem will be that even though the table is empty, it
might actually be of interest to fetch the contents from another node on
recovery, but this is currently not possible since it is assumed that all
actions are present in the WAL and a memory table would not use the WAL for
this since one of the goals is to make it fast and avoid writes to disk.

This might be possible to piggyback on the first select or insert done on
the table, but it makes the system more complicated since it is easy to
miss one place where you need to do this fetching. If you always do this on
recovery it is a single place that you need to add and the system will
after that be in a predictable state.

In addition, if it were to be added to the first access of the table, it
would add execution time to this first operation, but most users would
assume that all such work is done at recovery and that the database is
"warm" after recovery.

However, this is slightly outside the discussion for this proposed change,
so we can ignore it for now.

>
> Mats's in-memory table is an interesting example. I guess you don't even
> try WAL-logging that, so it's OK that nothing happens at WAL replay. As
> you said, the callback to schedule deletion of the shared memory block
> and use an end-of-xact callback to perform the deletion. You're
> basically re-inventing a pending-deletes mechanism similar to smgr's.
>

> I think you could actually piggyback on smgr's pending-deletions
> mechanism instead of re-inventing it. In the callback, you can call
> smgrGetPendingDeletes(), and drop the shared memory segment for any
> relation in that list.
>

Hmm... it is a good point that smgrGetPendingDeletes() can be used in the
commit callback for something as simple as a memory table when all the data
is local. It should also work well with a distributed optimistic storage
engine when you certify the transaction at commit time. What will happen
then is that the actual "drop command" will be sent out at commit time
rather than when the command is actually executed.

My main interest is, however, to have an API that works for all kinds of
storage engines, not just limited to local storage but also supporting
distributed storage systems and also being able to interact with existing
implementations. There are a few reasons why getting a notification when
the table is dropped rather than when the commit is done is beneficial.

1. In a distributed storage engine you might want to distribute changes
speculatively when they happen so that the commit, once it occurs, will be
fast. By sending out the action early, you allow work to start
independently of the current machine, which will improve parallelization.
2. In a distributed storage engine or order of the statements received
remotely make a difference. For example, if you want to use a distributed
locking scheme for your distributed storage, you are currently forced to
implement an optimistic scheme while in reality you might want to
distribute the drop and lock the table exclusively on all remote nodes
(this is already what PostgreSQL does, locking the table on a drop). I do
realize that distributed transactions are not that simple and there are
other problems associated with this, but this would still introduce an
unnecessary restriction on what you can do.
3. A problem with optimistic protocols in general is that they drop in
performance when you have a lot of writes. It is simply the case that other
smaller transactions will constantly force a long-running transaction to be
aborted. This also means that there is a risk that a transaction that drops
a table will have to be aborted out of necessity since other transactions
are updating the table. In a distributed system there will be more of
those, so the odds of aborting "more important" transactions (in the sense
of needing stronger locks) is higher.
4. A smaller issue is that right now the storage manager (smgr) and the
transaction system are quite tightly coupled, which makes it more difficult
to make the storage system "pluggable". I think that not requiring the use
of pendingDeletes would move one step in the direction of removing this
coupling, but I am not entirely sure here.

It is likely that many of these problems can be worked around by placing
restrictions on how DDLs can be used in transactions, but that would create
unnecessary restrictions for the end-user. It might also be possible to
find implementation workarounds by placing code at strategic points in the
implementation, but this is error prone and the risk of making an error is
higher.

Best wishes,
Mats Kindahl

>
> - Heikki
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2022-07-29 10:59:03 Re: standby recovery fails (tablespace related) (tentative patch and discussion)
Previous Message Amit Kapila 2022-07-29 10:25:10 Re: Functions 'is_publishable_class' and 'is_publishable_relation' should stay together.