RE: POC: postgres_fdw insert batching

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Tomas Vondra' <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: 'Tomas Vondra' <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: POC: postgres_fdw insert batching
Date: 2020-11-19 02:43:07
Message-ID: TYAPR01MB2990A726F963C10DF3B98790FEE00@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
> Unfortunately, this does not compile for me, because nodeModifyTable calls
> ExecGetTouchedPartitions, which is not defined anywhere. Not sure what's
> that about, so I simply commented-out this. That probably fails the partitioned
> cases, but it allowed me to do some review and testing.

Ouch, sorry. I'm ashamed to have forgotten including execPartition.c.

> The are a couple other smaller changes. E.g. it undoes changes to
> finish_foreign_modify, and instead calls separate functions to prepare the bulk
> statement. It also adds list_make5/list_make6 macros, so as to not have to do
> strange stuff with the parameter lists.

Thanks, I'll take them thankfully! I wonder why I didn't think of separating deallocate_query() from finish_foreign_modify() ... perhaps my brain was dying. As for list_make5/6(), I saw your first patch avoid adding them, so I thought you found them ugly (and I felt so, too.) But thinking now, there's no reason to hesitate it.

> A finally, this should probably add a bunch of regression tests.

Sure.

> 1) As I mentioned before, I really don't think we should be doing deparsing in
> execute_foreign_modify - that's something that should happen earlier, and
> should be in a deparse.c function.
...
> The attached patch tries to address both of these points.
>
> Firstly, it adds a new deparseBulkInsertSql function, that builds a query for the
> "full" batch, and then uses those two queries - when we get a full batch we use
> the bulk query, otherwise we use the single-row query in a loop. IMO this is
> cleaner than deparsing queries ad hoc in the execute_foreign_modify.
...
> Of course, this might be worse when we don't have a full batch, e.g. for a query
> that insert only 50 rows with batch_size=100. If this case is common, one
> option would be lowering the batch_size accordingly. If we really want to
> improve this case too, I suggest we pass more info than just a position of the
> VALUES clause - that seems a bit too hackish.
...
> Secondly, it adds the batch_size option to server/foreign table, and uses that.
> This is not complete, though. postgresPlanForeignModify currently passes a
> hard-coded value at the moment, it needs to lookup the correct value for the
> server/table from RelOptInfo or something. And I suppose ModifyTable
> inftractructure will need to determine the value in order to pass the correct
> number of slots to the FDW API.

I can sort of understand your feeling, but I'd like to reconstruct the query and prepare it in execute_foreign_modify() because:

* Some of our customers use bulk insert in ECPG (INSERT ... VALUES(record1, (record2), ...) to insert variable number of records per query. (Oracle's Pro*C has such a feature.) So, I want to be prepared to enable such a thing with FDW.

* The number of records to insert is not known during planning (in general), so it feels natural to get prepared during execution phase, or not unnatural at least.

* I wanted to avoid the overhead of building the full query string for 100-record insert statement during query planning, because it may be a bit costly for usual 1-record inserts. (The overhead may be hidden behind the high communication cost of postgres_fdw, though.)

So, in terms of code cleanness, how about moving my code for rebuilding query string from execute_foreign_modify() to some new function in deparse.c?

> 2) I think the GUC should be replaced with an server/table option, similar to
> fetch_size.

Hmm, batch_size differs from fetch_size. fetch_size is a postgres_fdw-specific feature with no relevant FDW routine, while batch_size is a configuration parameter for all FDWs that implement ExecForeignBulkInsert(). The ideas I can think of are:

1. Follow JDBC/ODBC and add standard FDW properties. For example, the JDBC standard defines standard connection pool properties such as maxPoolSize and minPoolSize. JDBC drivers have to provide them with those defined names. Likewise, the FDW interface requires FDW implementors to handle the foreign server option name "max_bulk_insert_tuples" if he/she wants to provide bulk insert feature and implement ExecForeignBulkInsert(). The core executor gets that setting from the FDW by calling a new FDW routine like GetMaxBulkInsertTuples(). Sigh...

2. Add a new max_bulk_insert_tuples reloption to CREATE/ALTER FOREIGN TABLE. executor gets the value from Relation and uses it. (But is this a table-specific configuration? I don't think so, sigh...)

3. Adopt the current USERSET GUC max_bulk_insert_tuples. I think this is enough because the user can change the setting per session, application, and database.

Regards
Takayuki Tsunakawa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-11-19 03:13:44 Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)
Previous Message Hubert Zhang 2020-11-19 02:18:17 Recv-Q buffer is filled up due to bgwriter continue sending statistics to un-launched stat collector