Re: Multi Inserts in CREATE TABLE AS - revived patch

From: Luc Vlaming <luc(at)swarm64(dot)com>
To: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Paul Guo <guopa(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Multi Inserts in CREATE TABLE AS - revived patch
Date: 2020-11-27 06:51:59
Message-ID: 4eee0730-f6ec-e72d-3477-561643f4b327@swarm64.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26-11-2020 14:45, Bharath Rupireddy wrote:
> On Thu, Nov 26, 2020 at 5:34 PM Luc Vlaming <luc(at)swarm64(dot)com> wrote:
>>
>> On 26-11-2020 12:36, Bharath Rupireddy wrote:
>>> Few things:
>>>
>>> IIUC Andres mentioned similar kinds of APIs earlier in [1].
>>>
>>> [1] -
>>> https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de
>>> <https://www.postgresql.org/message-id/20200924024128.kyk3r5g7dnu3fxxx%40alap3.anarazel.de>
>>>
>>> I would like to add some more info to one of the API:
>>>
>>> typedef struct MultiInsertStateData
>>> {
>>> MemoryContext micontext; /* A temporary memory context for
>>> multi insert. */
>>> BulkInsertStateData *bistate; /* Bulk insert state. */
>>> TupleTableSlot **mislots; /* Array of buffered slots. */
>>> uint32 nslots; /* Total number of buffered slots. */
>>> int64 nbytes; /* Flush buffers if the total tuple size
>>> >= nbytes. */
>>> int32 nused; /* Number of current buffered slots for a
>>> multi insert batch. */
>>> int64 nsize; /* Total tuple size for a multi insert
>>> batch. */
>>> } MultiInsertStateData;
>>>
>>> /* Creates a temporary memory context, allocates the
>>> MultiInsertStateData, BulkInsertStateData and initializes other members. */
>>> void (*begin_multi_insert) (Relation rel,
>>> MultiInsertStateData **mistate, uint32 nslots, uint64 nbytes);
>>>
>>> /* Buffers the input slot into mistate slots, computes the size of the
>>> tuple, and adds it total buffer tuple size, if this size crosses
>>> mistate->nbytes, flush the buffered tuples into table. For heapam,
>>> existing heap_multi_insert can be used. Once the buffer is flushed, then
>>> the micontext can be reset and buffered slots can be cleared. *If nbytes
>>> i.e. total tuple size of the batch is not given, tuple size is not
>>> calculated, tuples are buffered until all the nslots are filled and then
>>> flushed.* */
>>> void (*do_multi_insert) (Relation rel, MultiInsertStateData
>>> *mistate, struct TupleTableSlot *slot, CommandId cid, int options);
>>>
>>> /* Flush the buffered tuples if any. For heapam, existing
>>> heap_multi_insert can be used. Deletes temporary memory context and
>>> deallocates mistate. */
>>> void (*end_multi_insert) (Relation rel, MultiInsertStateData
>>> *mistate, CommandId cid, int options);
>>
>> Looks all good to me, except for the nbytes part.
>> Could you explain to me what use case that supports? IMHO the tableam
>> can best decide itself that its time to flush, based on its
>> implementation that e.g. considers how many pages to flush at a time and
>> such, etc? This means also that most of the fields of
>> MultiInsertStateData can be private as each tableam would return a
>> derivative of that struct (like with the destreceivers).
>>
>
> nbytes is basically to support the following case, say the number of
> tuples to buffer is 1000, and if all the tuples are toasted with size
> in few hundred MB or even GB, then do we want to wait until 1000
> tuples are buffered in which case we occupy for one query 1000*toasted
> tuple size in GB. So, if we have a memory limit, then it will give
> flexibility. Whether to use it or not is up to the table AM
> implementation. And also that existing copy code(since it can know the
> tuple size after parsing input data) uses this mechanism to decide
> when to flush.
>
> If the nbytes is not used in a table am, then the multi insert can
> wait until the total tuples, how much ever large memory they occupy,
> are buffered.
>
> IMO, we can retain nbytes for now to decide on when to flush. Thoughts?

I'm very sorry I had not realized at all that the toasted data would be
kept in memory until written out. I guess I'm not familiar enough with
that part yet. I assumed this would be toasted beforehand and be tableam
agnostic, and that any decision from the tableam to flush would happen
way before a lot memory would have accumulated, which is a bit naive in
hindsight.

>
> I wonder, how can the do_multi_insert() API decide on when to flush, I
> mean, based on the number of pages to flush? Do we need to pass the
> maximum number of pages the buffered tuples can occupy and track the
> pages currently buffered tuples occupy to decide when to flush? Or is
> it something that the existing table AM infrastructure already
> supports? If we use the number of pages to decide on when to flush,
> how well it works with parallel inserts?
>

I was assuming each tableam to use its own logic, based on its needs and
the tradeoffs a storage engine might want to provide. This does not mean
it should not consider outside parameters, like the aforementioned
memory usage.
I think it would imply that each tableam implements its own tracking
mechanism for how much has accumulated, how, and when to flush, because
they might track different statistics. IMHO given that each tableam
anyway would want to implement its own logic on how to store a slot into
a page, tracking the logic for tracking these statistics seemed minor to
me. Maybe I missed some parts that should be extracted out to a generic
interface however?

Some examples of why a tableam could decide on its own on when to flush:
- the current heap implementation could accumulate a few pages (say up
to 64) and thereby limit the amount of calls to write() and limit the
accompanying blocks/context switches. This would also then make the
writes more sequential wrt the processes which can help with the
flushing I presume, like how the sequential scan was optimized to
process a consequtive set of blocks per worker (see
table_block_parallelscan_nextpage).
- something like zheap could accumulate data based on the amount of
columns so that a page with column data is completely filled, thereby
limiting the write amplification.
- something that would implement an lsm storage might accumulate a full
in-memory level before flushing it out.

>>
>> One thing I'm wondering is in which memory context the slots end up
>> being allocated. I'd assume we would want to keep the slots around
>> between flushes. If they are in the temporary context this might prove
>> problematic however?
>>
>
> I should not have used the word temporary, it actually is not
> temporary. This memory conext will be created in begin_multi_insert(),
> all the buffered tuples are copied using this context, it will be
> reset at the end of each flush and reused. It can get destroyed at the
> end in end_multi_insert(). I think we should even do this with the new
> APIs implementation.
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>

Okay. But in which context are the slots themselves allocated then?
Because if we allocate the slots themselves in the context and then
reset with each flush we would have to also re-allocate the slots every
flush, which seems wasteful to me?

Regards,
Luc
Swarm64

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-11-27 06:55:03 Re: Parallel Inserts in CREATE TABLE AS
Previous Message Kyotaro Horiguchi 2020-11-27 06:38:43 Re: Disable WAL logging to speed up data loading