Re: New Table Access Methods for Multi and Single Inserts

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, Luc Vlaming <luc(at)swarm64(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Re: New Table Access Methods for Multi and Single Inserts
Date: 2024-03-23 00:17:12
Message-ID: 6be6f58815dc0844fbe058edf56b4e735a6efc1c.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2024-03-21 at 13:10 +0530, Bharath Rupireddy wrote:
> I'm attaching the v13 patches using virtual tuple slots for buffered
> tuples for multi inserts.

Comments:

* Do I understand correctly that CMV, RMV, and CTAS experience a
performance benefit, but COPY FROM does not? And is that because COPY
already used table_multi_insert, whereas CMV and RMV did not?

* In the COPY FROM code, it looks like it's deciding whether to flush
based on MAX_BUFFERED_TUPLES, but the slot array is allocated with
MAX_BUFFERED_SLOTS (they happen to be the same for heap, but perhaps
not for other AMs). The copy code shouldn't be using internal knowledge
of the multi-insert code; it should know somehow from the API when the
right time is to flush.

* How is the memory management expected to work? It looks like COPY
FROM is using the ExprContext when running the input functions, but we
really want to be using a memory context owned by the table AM, right?

* What's the point of the table_multi_insert_slots() "num_slots"
argument? The only caller simply discards it.

* table_tuple_insert_v2 isn't called anywhere, what's it for?

* the "v2" naming is inconsistent -- it seems you only added it in
places where there's a name conflict, which makes it hard to tell which
API methods go together. I'm not sure how widely table_multi_insert* is
used outside of core, so it's possible that we may even be able to just
change those APIs and the few extensions that call it can be updated.

* Memory usage tracking should be done in the AM by allocating
everything in a single context so it's easy to check the size. Don't
manually add up memory.

* I don't understand: "Caller may have got the slot using
heap_multi_insert_next_free_slot, filled it and passed. So, skip
copying in such a case." If the COPY FROM had a WHERE clause and
skipped a tuple after filling the slot, doesn't that mean the slot has
bogus data from the last tuple?

* We'd like this to work for insert-into-select (IIS) and logical
replication, too. Do you see any problem there, or is it just a matter
of code?

* Andres had some comments[1] that don't seem entirely addressed.
- You are still allocating the AM-specific part of TableModifyState
as a separately-allocated chunk.
- It's still called TableInsertState rather than TableModifyState as
he suggested. If you change that, you should also change to
table_modify_begin/end.
- CID: I suppose Andres is considering the use case of "BEGIN; ...
ten thousand inserts ... COMMIT;". I don't think this problem is really
solvable (discussed below) but we should have some response/consensus
on that point.
- He mentioned that we only need one new method implemented by the
AM. I don't know if one is enough, but 7 does seem excessive. I have
some simplification ideas below.

Overall:

If I understand correctly, there are two ways to use the API:

1. used by CTAS, MV:

tistate = table_insert_begin(...);
table_multi_insert_v2(tistate, tup1);
...
table_multi_insert_v2(tistate, tupN);
table_insert_end(tistate);

2. used by COPY ... FROM:

tistate = table_insert_begin(..., SKIP_FLUSH);
if (multi_insert_slot_array_is_full())
table_multi_insert_flush(tistate);
slot = table_insert_next_free_slot(tistate);
... fill slot with tup1
table_multi_insert_v2(tistate, tup1);
...
slot = table_insert_next_free_slot(tistate);
... fill slot with tupN
table_multi_insert_v2(tistate, tupN);
table_insert_end(tistate);

Those two uses need comments explaining what's going on. It appears the
SKIP_FLUSH flag is used to indicate which use the caller intends.

Use #2 is not enforced well by either the API or runtime checks. If the
caller neglects to check for a full buffer, it appears that it will
just overrun the slots array.

Also, for use #2, table_multi_insert_v2() doesn't do much other than
incrementing the memory used. The slot will never be NULL because it
was obtained with table_multi_insert_next_free_slot(), and the other
two branches don't happen when SKIP_FLUSH is true.

The real benefit to COPY of your new API is that the AM can manage
slots for itself, and how many tuples may be tracked (which might be a
lot higher for non-heap AMs).

I agree with Luc Vlaming's comment[2] that more should be left to the
table AM. Your patch tries too hard to work with the copyfrom.c slot
array, somehow sharing it with the table AM. That adds complexity to
the API and feels like a layering violation.

We also shouldn't mandate a slot array in the API. Each slot is 64
bytes -- a lot of overhead for small tuples. For a non-heap AM, it's
much better to store the tuple data in a big contiguous chunk with
minimal overhead.

Let's just have a simple API like:

tmstate = table_modify_begin(...);
table_modify_save_insert(tmstate, tup1);
...
table_modify_save_insert(tmstate, tupN);
table_modify_end(tmstate);

and leave it up to the AM to do all the buffering and flushing work (as
Luc Vlaming suggested[2]).

That leaves one problem, which is: how do we update the indexes and
call AR triggers while flushing? I think the best way is to just have a
callback in the TableModifyState that is called during flush. (I don't
think that would affect performance, but worth double-checking.)

We have to disable this whole multi-insert mechanism if there are
volatile BR/AR triggers, because those are supposed to see already-
inserted tuples. That's not a problem with your patch but it is a bit
unfortunate -- triggers can be costly already, but this increases the
penalty. There may be some theoretical ways to avoid this problem, like
reading tuples out of the unflushed buffer during a SELECT, which
sounds a little too clever (though perhaps not completely crazy if the
AM is in control of both?).

For potentially working with multi-updates/deletes, it might be as
simple as tracking the old TIDs along with the slots and having new
_save_update and _save_delete methods. I haven't thought deeply about
that, and I'm not sure we have a good example AM to work with, but it
seems plausible that we could make something useful here.

To batch multiple different INSERT statements within a transaction just
seems like a really hard problem. That could mean different CIDs, but
also different subtransaction IDs. Constraint violation errors will
happen at the time of flushing, which could be many commands later from
the one that actually violates the constraint. And what if someone
issues a SELECT in the middle of the transaction, how does it see the
already-inserted-but-not-flushed tuples? If that's not hard enough
already, then you would also need to extend low-level APIs to accept
arbitrary CIDs and subxact IDs when storing tuples during a flush. The
only way I could imagine solving all of these problems is declaring
somehow that your transaction won't do any of these complicated things,
and that you don't mind getting constraint violations at the wrong
time. So I recommend that you punt on this problem.

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/20230603223824.o7iyochli2dwwi7k%40alap3.anarazel.de
[2]
https://www.postgresql.org/message-id/508af801-6356-d36b-1867-011ac6df8f55%40swarm64.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-03-23 00:22:11 Re: BitmapHeapScan streaming read user and prelim refactoring
Previous Message Andres Freund 2024-03-23 00:03:36 Re: Trying to build x86 version on windows using meson