RE: POC: postgres_fdw insert batching

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Tomas Vondra' <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: 'Tomas Vondra' <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: POC: postgres_fdw insert batching
Date: 2020-11-11 07:20:15
Message-ID: TYAPR01MB299099C34392F2A9FA0777BDFEE80@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
> I see the patch builds the "bulk" query in execute_foreign_modify. IMO
> that's something we should do earlier, when we're building the simple
> query (for 1-row inserts). I'd understand if you were concerned about
> overhead in case of 1-row inserts, trying to not plan the bulk query
> until necessary, but I'm not sure this actually helps.
>
> Or was the goal to build a query for every possible number of slots? I
> don't think that's really useful, considering it requires deallocating
> the old plan, preparing a new one, etc. IMO it should be sufficient to
> have two queries - one for 1-row inserts, one for the full batch. The
> last incomplete batch can be inserted using a loop of 1-row queries.
>
> That's what my patch was doing, but I'm not insisting on that - it just
> seems like a better approach to me. So feel free to argue why this is
> better.

Don't be concerned, the processing is not changed for 1-row inserts: the INSERT query string is built in PlanForeignModify(), and the remote statement is prepared in execute_foreign_modify() during the first call to ExecForeignInsert() and it's reused for subsequent ExecForeignInsert() calls.

The re-creation of INSERT query string and its corresponding PREPARE happen when the number of tuples to be inserted is different from the previous call to ExecForeignInsert()/ExecForeignBulkInsert(). That's because we don't know how many tuples will be inserted during planning (PlanForeignModify) or execution (until the scan ends for SELECT). For example, if we insert 10,030 rows with the bulk size 100, the flow is:

PlanForeignModify():
build the INSERT query string for 1 row
ExecForeignBulkInsert(100):
drop the INSERT query string and prepared statement for 1 row
build the query string and prepare statement for 100 row INSERT
execute it
ExecForeignBulkInsert(100):
reuse the prepared statement for 100 row INSERT and execute it
...
ExecForeignBulkInsert(30):
drop the INSERT query string and prepared statement for 100 row
build the query string and prepare statement for 30 row INSERT
execute it

> I think it'd be good to have such GUC, even if only for testing and
> development. We should probably have a way to disable the batching,
> which the GUC could also do, I think. So +1 to have the GUC.

OK, I'll add it. The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local relations in the future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, while "row" or "record" is not used in GUC (except for row_security).

The valid range would be between 1 and 1,000 (I said 10,000 previously, but I think it's overreaction and am a bit worried about unforseen trouble too many tuples might cause.) 1 disables the bulk processing and uses the traditonal ExecForeignInsert(). The default value is 100 (would 1 be sensible as a default value to avoid surprising users by increased memory usage?)

> > 2. Should we accumulate records per relation in ResultRelInfo
> > instead? That is, when inserting into a partitioned table that has
> > foreign partitions, delay insertion until a certain number of input
> > records accumulate, and then insert accumulated records per relation
> > (e.g., 50 records to relation A, 30 records to relation B, and 20
> > records to relation C.) If we do that,
> >
>
> I think there's a chunk of text missing here? If we do that, then what?

Sorry, the two bullets below there are what follows. Perhaps I should have written ":" instead of ",".

> Anyway, I don't see why accumulating the records in ResultRelInfo would
> be better than what the patch does now. It seems to me like fairly
> specific to FDWs, so keeping it int FDW state seems appropriate. What
> would be the advantage of stashing it in ResultRelInfo?

I thought of distributing input records to their corresponding partitions' ResultRelInfos. For example, input record for partition 1 comes, store it in the ResultRelInfo for partition 1, then input record for partition 2 comes, store it in the ResultRelInfo for partition 2. When a ResultRelInfo accumulates some number of rows, insert the accumulated rows therein into the partition. When the input endds, perform bulk inserts for ResultRelInfos that have accumulated rows.

> I think that's OK for most use cases, and if it's not (e.g. when there's
> something requiring the exact order of writes) then it's not possible to
> use batching. That's one of the reasons why I think we should have a GUC
> to disable the batching.

Agreed.

> > * Should the maximum count of accumulated records be applied per
> > relation or the query? When many foreign partitions belong to a
> > partitioned table, if the former is chosen, it may use much memory in
> > total. If the latter is chosen, the records per relation could be
> > few and thus the benefit of bulk insert could be small.
> >
>
> I think it needs to be applied per relation, because that's the level at
> which we can do it easily and consistently. The whole point is to send
> data in sufficiently large chunks to minimize the communication overhead
> (latency etc.), but if you enforce it "per query" that seems hard.
>
> Imagine you're inserting data into a table with many partitions - how do
> you pick the number of rows to accumulate? The table may have 10 or 1000
> partitions, we may be inserting into all partitions or just a small
> subset, not all partitions may be foreign, etc. It seems pretty
> difficult to pick and enforce a reliable limit at the query level. But
> maybe I'm missing something and it's easier than I think?
>
> Of course, you're entirely correct enforcing this at the partition level
> may require a lot of memory. Sadly, I don't see a way around that,
> except for (a) disabling batching or (b) ordering the data to insert
> data into one partition at a time.

OK, I think I'll try doing like that, after waiting for other opinions some days.

> Two more comments regarding this:
>
> 1) If we want to be more strict about the memory consumption, we should
> probably set the limit in terms of memory, not number of rows. Currently
> the 100 rows may be 10kB or 10MB, there's no way to know. Of course,
> this is not the only place with this issue.
>
> 2) I wonder what the COPY FROM patch [1] does in this regard. I don't
> have time to check right now, but I suggest we try to do the same thing,
> if only to be consistent.
>
> [1]
> https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-909
> 86e55489f%40postgrespro.ru

That COPY FROM patch uses the tuple accumulation mechanism for local tables as-is. That is, it accumulates at most 1,000 tuples per partition.

/*
* No more than this many tuples per CopyMultiInsertBuffer
*
* Caution: Don't make this too big, as we could end up with this many
* CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
* multiInsertBuffers list. Increasing this can cause quadratic growth in
* memory requirements during copies into partitioned tables with a large
* number of partitions.
*/
#define MAX_BUFFERED_TUPLES 1000

Regards
Takayuki Tsunakawa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2020-11-11 08:04:05 Re: Skip ExecCheckRTPerms in CTAS with no data
Previous Message Pavel Stehule 2020-11-11 07:09:40 Re: proposal: possibility to read dumped table's name from file