Re: New Table Access Methods for Multi and Single Inserts

From: Luc Vlaming <luc(at)swarm64(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Paul Guo <guopa(at)vmware(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: New Table Access Methods for Multi and Single Inserts
Date: 2021-01-04 07:59:01
Message-ID: 96eaa813-4ad6-e80a-04a4-cc8082d356dd@swarm64.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 28-12-2020 13:48, Bharath Rupireddy wrote:
> On Fri, Dec 25, 2020 at 8:10 AM Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>> On Thu, Dec 24, 2020 at 05:48:42AM +0530, Bharath Rupireddy wrote:
>>> I'm not posting the updated 0002 to 0004 patches, I plan to do so
>>> after a couple of reviews happen on the design of the APIs in 0001.
>>>
>>> Thoughts?
>>
>> Are you familiar with this work ?
>>
>> https://commitfest.postgresql.org/31/2717/
>> Reloptions for table access methods
>>
>> It seems like that can be relevant for your patch, and I think some of what
>> your patch needs might be provided by AM opts.
>>
>> It's difficult to generalize AMs when we have only one, but your use-case might
>> be a concrete example which would help to answer some questions on the other
>> thread.
>>
>> @Jeff: https://commitfest.postgresql.org/31/2871/
>
> Note that I have not gone through the entire thread at [1]. On some
> initial study, that patch is proposing to allow different table AMs to
> have custom rel options.
>
> In the v2 patch that I sent upthread [2] for new table AMs has heap AM
> multi insert code moved inside the new heap AM implementation and I
> don't see any need of having rel options. In case, any other AMs want
> to have the control for their multi insert API implementation via rel
> options, I think the proposal at [1] can be useful.
>
>
> Thoughts?
>
> [1] - https://commitfest.postgresql.org/31/2717/
> [2] - https://www.postgresql.org/message-id/CALj2ACWMnZZCu%3DG0PJkEeYYicKeuJ-X%3DSU19i6vQ1%2B%3DuXz8u0Q%40mail.gmail.com
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>
Hi,

> IIUC, there's no dependency or anything as such for the new table AM
> patch with the rel options thread [1]. If I'm right, can this new
> table AM patch [2] be reviewed further?

To me this seems good enough. Reason is that I anticipate that there
would not necessarily be per-table options for now but rather global
options, if any. Moreover, if we want to make these kind of tradeoffs
user-controllable I would argue this should be done in a different
patch-set either way. Reason is that there are parameters in heap
already that are computed / hardcoded as well (see e.g.
RelationAddExtraBlocks).

===

As to the patches themselves:

I think the API is a huge step forward! I assume that we want to have a
single-insert API like heap_insert_v2 so that we can encode the
knowledge that there will just be a single insert coming and likely a
commit afterwards?

Reason I'm asking is that I quite liked the heap_insert_begin parameter
is_multi, which could even be turned into a "expected_rowcount" of the
amount of rows expected to be commited in the transaction (e.g. single,
several, thousands/stream).
If we were to make the API based on expected rowcounts, the whole
heap_insert_v2, heap_insert and heap_multi_insert could be turned into a
single function heap_insert, as the knowledge about buffering of the
slots is then already stored in the TableInsertState, creating an API like:

// expectedRows: -1 = streaming, otherwise expected rowcount.
TableInsertState* heap_insert_begin(Relation rel, CommandId cid, int
options, int expectedRows);
heap_insert(TableInsertState *state, TupleTableSlot *slot);

Do you think that's a good idea?

Two smaller things I'm wondering:
- the clear_mi_slots; why is this not in the HeapMultiInsertState? the
slots themselves are declared there? also, the boolean themselves is
somewhat problematic I think because it would only work if you specified
is_multi=true which would depend on the actual tableam to implement this
then in a way that copy/ctas/etc can also use the slot properly, which I
think would severely limit their freedom to store the slots more
efficiently? Also, why do we want to do ExecClearTuple() anyway? Isn't
it good enough that the next call to ExecCopySlot will effectively clear
it out?
- flushed -> why is this a stored boolean? isn't this indirectly encoded
by cur_slots/cur_size == 0?

For patches 02-04 I quickly skimmed through them as I assume we first
want the API agreed upon. Generally they look nice and like a big step
forward. What I'm just wondering about is the usage of the
implementation details like mistate->slots[X]. It makes a lot of sense
to do so but also makes for a difficult compromise, because now the
tableam has to guarantee a copy of the slot, and hopefully even one in a
somewhat efficient form.

Kind regards,
Luc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Luc Vlaming 2021-01-04 08:04:16 Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW
Previous Message Justin Pryzby 2021-01-04 07:06:00 Re: zstd compression for pg_dump