Re: New Table Access Methods for Multi and Single Inserts

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Luc Vlaming <luc(at)swarm64(dot)com>
Cc: Justin Pryzby <pryzby(at)telsasoft(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Paul Guo <guopa(at)vmware(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: New Table Access Methods for Multi and Single Inserts
Date: 2021-01-05 10:06:24
Message-ID: CALj2ACVsiAZMsP8p5MPg6SSEtoMFFaiAa6j2AFtEQJDhfbgs3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 4, 2021 at 1:29 PM Luc Vlaming <luc(at)swarm64(dot)com> wrote:
> > table AM patch [2] be reviewed further?
> As to the patches themselves:
>
> I think the API is a huge step forward! I assume that we want to have a
> single-insert API like heap_insert_v2 so that we can encode the
> knowledge that there will just be a single insert coming and likely a
> commit afterwards?
>
> Reason I'm asking is that I quite liked the heap_insert_begin parameter
> is_multi, which could even be turned into a "expected_rowcount" of the
> amount of rows expected to be commited in the transaction (e.g. single,
> several, thousands/stream).
> If we were to make the API based on expected rowcounts, the whole
> heap_insert_v2, heap_insert and heap_multi_insert could be turned into a
> single function heap_insert, as the knowledge about buffering of the
> slots is then already stored in the TableInsertState, creating an API
like:
>
> // expectedRows: -1 = streaming, otherwise expected rowcount.
> TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int
> options, int expectedRows);
> heap_insert(TableInsertState *state, TupleTableSlot *slot);
>
> Do you think that's a good idea?

IIUC, your suggestion is to use expectedRows and move the multi insert
implementation heap_multi_insert_v2 to heap_insert_v2. If that's correct,
so heap_insert_v2 will look something like this:

heap_insert_v2()
{
if (single_insert)
//do single insertion work, the code in existing heap_insert_v2 comes
here
else
//do multi insertion work, the code in existing heap_multi_insert_v2
comes here
}

I don't see any problem in combining single and multi insert APIs into one.
Having said that, will the APIs be cleaner then? Isn't it going to be
confusing if a single heap_insert_v2 API does both the works? With the
existing separate APIs, for single insertion, the sequence of the API can
be like begin, insert_v2, end and for multi inserts it's like begin,
multi_insert_v2, flush, end. I prefer to have a separate multi insert API
so that it will make the code look readable.

Thoughts?

> Two smaller things I'm wondering:
> - the clear_mi_slots; why is this not in the HeapMultiInsertState? the
> slots themselves are declared there?

Firstly, we need to have the buffered slots sometimes(please have a look at
the comments in TableInsertState structure) outside the multi_insert API.
And we need to have cleared the previously flushed slots before we start
buffering in heap_multi_insert_v2(). I can remove the clear_mi_slots flag
altogether and do as follows: I will not set mistate->cur_slots to 0 in
heap_multi_insert_flush after the flush, I will only set state->flushed to
true. In heap_multi_insert_v2,

void
heap_multi_insert_v2(TableInsertState *state, TupleTableSlot *slot)
{
TupleTableSlot *batchslot;
HeapMultiInsertState *mistate = (HeapMultiInsertState *)state->mistate;
Size sz;

Assert(mistate && mistate->slots);

* /* if the slots are flushed previously then clear them off before using
them again. */ if (state->flushed) { int i; for (i = 0;
i < mistate->cur_slots; i++) ExecClearTuple(mistate->slots[i]);
mistate->cur_slots = 0; state->flushed = false }*

if (mistate->slots[mistate->cur_slots] == NULL)
mistate->slots[mistate->cur_slots] =
table_slot_create(state->rel, NULL);

batchslot = mistate->slots[mistate->cur_slots];

ExecCopySlot(batchslot, slot);

Thoughts?

> Also, why do we want to do ExecClearTuple() anyway? Isn't
> it good enough that the next call to ExecCopySlot will effectively clear
> it out?

For virtual, heap, minimal tuple slots, yes ExecCopySlot slot clears the
slot before copying. But, for buffer heap slots, the
tts_buffer_heap_copyslot does not always clear the destination slot, see
below. If we fall into else condition, we might get some issues. And also
note that, once the slot is cleared in ExecClearTuple, it will not be
cleared again in ExecCopySlot because TTS_SHOULDFREE(slot) will be false.
That is why, let's have ExecClearTuple as is.

/*
* If the source slot is of a different kind, or is a buffer slot that
has
* been materialized / is virtual, make a new copy of the tuple.
Otherwise
* make a new reference to the in-buffer tuple.
*/
if (dstslot->tts_ops != srcslot->tts_ops ||
TTS_SHOULDFREE(srcslot) ||
!bsrcslot->base.tuple)
{
MemoryContext oldContext;

ExecClearTuple(dstslot);
}
else
{
Assert(BufferIsValid(bsrcslot->buffer));

tts_buffer_heap_store_tuple(dstslot, bsrcslot->base.tuple,
bsrcslot->buffer, false);

> - flushed -> why is this a stored boolean? isn't this indirectly encoded
> by cur_slots/cur_size == 0?

Note that cur_slots is in HeapMultiInsertState and outside of the new APIs
i.e. in TableInsertState, mistate is a void pointer, and we can't really
access the cur_slots. I mean, we can access but we need to be dereferencing
using the tableam kind. Instead of doing all of that, to keep the API
cleaner, I chose to have a boolean in the TableInsertState which we can see
and use outside of the new APIs. Hope that's fine.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-01-05 10:32:04 Re: Parallel Inserts in CREATE TABLE AS
Previous Message Peter Smith 2021-01-05 10:02:25 Re: Single transaction in the tablesync worker?