RE: POC: postgres_fdw insert batching

From: Tim(dot)Colles(at)ed(dot)ac(dot)uk
To: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
Cc: "'Tomas Vondra'" <tomas(dot)vondra(at)enterprisedb(dot)com>, "'Tomas Vondra'" <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: POC: postgres_fdw insert batching
Date: 2020-11-11 16:04:37
Message-ID: alpine.DEB.2.22.394.2011110818570.809469@corona
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 11 Nov 2020, tsunakawa(dot)takay(at)fujitsu(dot)com wrote:

> This email was sent to you by someone outside of the University.
> You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
>
> From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
>> I see the patch builds the "bulk" query in execute_foreign_modify. IMO
>> that's something we should do earlier, when we're building the simple
>> query (for 1-row inserts). I'd understand if you were concerned about
>> overhead in case of 1-row inserts, trying to not plan the bulk query
>> until necessary, but I'm not sure this actually helps.
>>
>> Or was the goal to build a query for every possible number of slots? I
>> don't think that's really useful, considering it requires deallocating
>> the old plan, preparing a new one, etc. IMO it should be sufficient to
>> have two queries - one for 1-row inserts, one for the full batch. The
>> last incomplete batch can be inserted using a loop of 1-row queries.
>>
>> That's what my patch was doing, but I'm not insisting on that - it just
>> seems like a better approach to me. So feel free to argue why this is
>> better.
>
> Don't be concerned, the processing is not changed for 1-row inserts: the INSERT query string is built in PlanForeignModify(), and the remote statement is prepared in execute_foreign_modify() during the first call to ExecForeignInsert() and it's reused for subsequent ExecForeignInsert() calls.
>
> The re-creation of INSERT query string and its corresponding PREPARE happen when the number of tuples to be inserted is different from the previous call to ExecForeignInsert()/ExecForeignBulkInsert(). That's because we don't know how many tuples will be inserted during planning (PlanForeignModify) or execution (until the scan ends for SELECT). For example, if we insert 10,030 rows with the bulk size 100, the flow is:
>
> PlanForeignModify():
> build the INSERT query string for 1 row
> ExecForeignBulkInsert(100):
> drop the INSERT query string and prepared statement for 1 row
> build the query string and prepare statement for 100 row INSERT
> execute it
> ExecForeignBulkInsert(100):
> reuse the prepared statement for 100 row INSERT and execute it
> ...
> ExecForeignBulkInsert(30):
> drop the INSERT query string and prepared statement for 100 row
> build the query string and prepare statement for 30 row INSERT
> execute it
>
>
>> I think it'd be good to have such GUC, even if only for testing and
>> development. We should probably have a way to disable the batching,
>> which the GUC could also do, I think. So +1 to have the GUC.
>
> OK, I'll add it. The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local relations in the future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, while "row" or "record" is not used in GUC (except for row_security).
>
> The valid range would be between 1 and 1,000 (I said 10,000 previously, but I think it's overreaction and am a bit worried about unforseen trouble too many tuples might cause.) 1 disables the bulk processing and uses the traditonal ExecForeignInsert(). The default value is 100 (would 1 be sensible as a default value to avoid surprising users by increased memory usage?)
>
>
>>> 2. Should we accumulate records per relation in ResultRelInfo
>>> instead? That is, when inserting into a partitioned table that has
>>> foreign partitions, delay insertion until a certain number of input
>>> records accumulate, and then insert accumulated records per relation
>>> (e.g., 50 records to relation A, 30 records to relation B, and 20
>>> records to relation C.) If we do that,
>>>
>>
>> I think there's a chunk of text missing here? If we do that, then what?
>
> Sorry, the two bullets below there are what follows. Perhaps I should have written ":" instead of ",".
>
>
>> Anyway, I don't see why accumulating the records in ResultRelInfo would
>> be better than what the patch does now. It seems to me like fairly
>> specific to FDWs, so keeping it int FDW state seems appropriate. What
>> would be the advantage of stashing it in ResultRelInfo?
>
> I thought of distributing input records to their corresponding partitions' ResultRelInfos. For example, input record for partition 1 comes, store it in the ResultRelInfo for partition 1, then input record for partition 2 comes, store it in the ResultRelInfo for partition 2. When a ResultRelInfo accumulates some number of rows, insert the accumulated rows therein into the partition. When the input endds, perform bulk inserts for ResultRelInfos that have accumulated rows.
>
>
>
>> I think that's OK for most use cases, and if it's not (e.g. when there's
>> something requiring the exact order of writes) then it's not possible to
>> use batching. That's one of the reasons why I think we should have a GUC
>> to disable the batching.
>
> Agreed.
>
>
>>> * Should the maximum count of accumulated records be applied per
>>> relation or the query? When many foreign partitions belong to a
>>> partitioned table, if the former is chosen, it may use much memory in
>>> total. If the latter is chosen, the records per relation could be
>>> few and thus the benefit of bulk insert could be small.
>>>
>>
>> I think it needs to be applied per relation, because that's the level at
>> which we can do it easily and consistently. The whole point is to send
>> data in sufficiently large chunks to minimize the communication overhead
>> (latency etc.), but if you enforce it "per query" that seems hard.
>>
>> Imagine you're inserting data into a table with many partitions - how do
>> you pick the number of rows to accumulate? The table may have 10 or 1000
>> partitions, we may be inserting into all partitions or just a small
>> subset, not all partitions may be foreign, etc. It seems pretty
>> difficult to pick and enforce a reliable limit at the query level. But
>> maybe I'm missing something and it's easier than I think?
>>
>> Of course, you're entirely correct enforcing this at the partition level
>> may require a lot of memory. Sadly, I don't see a way around that,
>> except for (a) disabling batching or (b) ordering the data to insert
>> data into one partition at a time.
>
> OK, I think I'll try doing like that, after waiting for other opinions some days.
>
>
>> Two more comments regarding this:
>>
>> 1) If we want to be more strict about the memory consumption, we should
>> probably set the limit in terms of memory, not number of rows. Currently
>> the 100 rows may be 10kB or 10MB, there's no way to know. Of course,
>> this is not the only place with this issue.
>>
>> 2) I wonder what the COPY FROM patch [1] does in this regard. I don't
>> have time to check right now, but I suggest we try to do the same thing,
>> if only to be consistent.
>>
>> [1]
>> https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-909
>> 86e55489f%40postgrespro.ru
>
> That COPY FROM patch uses the tuple accumulation mechanism for local tables as-is. That is, it accumulates at most 1,000 tuples per partition.
>
> /*
> * No more than this many tuples per CopyMultiInsertBuffer
> *
> * Caution: Don't make this too big, as we could end up with this many
> * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
> * multiInsertBuffers list. Increasing this can cause quadratic growth in
> * memory requirements during copies into partitioned tables with a large
> * number of partitions.
> */
> #define MAX_BUFFERED_TUPLES 1000
>
>
> Regards
> Takayuki Tsunakawa
>
>

Does this patch affect trigger semantics on the base table?

At the moment when I insert 1000 rows into a postgres_fdw table using a
single insert statement (e.g. INSERT INTO fdw_foo SELECT ... FROM bar) I
naively expect a "statement level" trigger on the base table to trigger
once. But this is not the case. The postgres_fdw implements this
operation as 1000 separate insert statements on the base table, so the
trigger happens 1000 times instead of once. Hence there is no
distinction between using a statement level and a row level trigger on
the base table in this context.

So would this patch change the behaviour so only 10 separate insert
statements (each of 100 rows) would be made against the base table?
If so thats useful as it means improving performance using statement
level triggers becomes possible. But it would also result in more
obscure semantics and might break user processes dependent on the
existing behaviour after the patch is applied.

BTW is this subtlety documented, I haven't found anything but happy
to be proved wrong?

Tim

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Knizhnik 2020-11-11 16:10:35 Re: Implementing Incremental View Maintenance
Previous Message Bruce Momjian 2020-11-11 15:56:08 Re: Proposition for autoname columns