RE: POC: postgres_fdw insert batching

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Tomas Vondra' <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: 'Tomas Vondra' <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: POC: postgres_fdw insert batching
Date: 2020-11-24 08:45:40
Message-ID: TYAPR01MB2990FB16E86A4E51BB3E700AFEFB0@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
> 1) We're calling it "batch_size" but the API function is named
> postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function
> to postgresGetModifyBatchSize()? That has the advantage it'd work if we
> ever add support for batching to UPDATE/DELETE.

Actually, I was in two minds whether the term batch or bulk is better. Because Oracle uses "bulk insert" and "bulk fetch", like in FETCH cur BULK COLLECT INTO array and FORALL in array INSERT INTO, while JDBC uses batch as in "batch updates" and its API method names (addBatch, executeBatch).

But it seems better or common to use batch according to the etymology and the following Stack Overflow page:

https://english.stackexchange.com/questions/141884/which-is-a-better-and-commonly-used-word-bulk-or-batch

OTOH, as for the name GetModifyBatchSize() you suggest, I think GetInsertBatchSize may be better. That is, this API deals with multiple records in a single INSERT statement. Your GetModifyBatchSize will be reserved for statement batching when libpq has supported batch/pipelining to execute multiple INSERT/UPDATE/DELETE statements, as in the following JDBC batch updates. What do you think?

CODE EXAMPLE 14-1 Creating and executing a batch of insert statements
--------------------------------------------------
Statement stmt = con.createStatement();
stmt.addBatch("INSERT INTO employees VALUES (1000, 'Joe Jones')");
stmt.addBatch("INSERT INTO departments VALUES (260, 'Shoe')");
stmt.addBatch("INSERT INTO emp_dept VALUES (1000, 260)");

// submit a batch of update commands for execution
int[] updateCounts = stmt.executeBatch();
--------------------------------------------------

> 2) Do we have to lookup the batch_size in create_foreign_modify (in
> server/table options)? I'd have expected to look it up while planning
> the modify and then pass it through the list, just like the other
> FdwModifyPrivateIndex stuff. But maybe that's not possible.

Don't worry, create_foreign_modify() is called from PlanForeignModify() during planning. Unfortunately, it's also called from BeginForeignInsert(), but other stuff passed to create_foreign_modify() including the query string is constructed there.

> 3) That reminds me - should we show the batching info on EXPLAIN? That
> seems like a fairly interesting thing to show to the user. Perhaps
> showing the average batch size would also be useful? Or maybe not, we
> create the batches as large as possible, with the last one smaller.

Hmm, maybe batch_size is not for EXPLAIN because its value doesn't change dynamically based on the planning or system state unlike shared buffers and parallel workers. OTOH, I sometimes want to see what configuration parameter values the user set, such as work_mem, enable_*, and shared_buffers, together with the query plan (EXPLAIN and auto_explain). For example, it'd be nice if EXPLAIN (parameters on) could do that. Some relevant FDW-related parameters could be included in that output.

> 4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and
> over for every tuple. I don't know it that has measurable impact, but it
> seems a bit excessive IMO. I don't think we should support the batch
> size changing during execution (seems tricky).

Don't worry about this, too. GetMaxBulkInsertTuples() just returns a value that was already saved in a struct in create_foreign_modify().

Regards
Takayuki Tsunakawa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2020-11-24 08:54:39 Re: Deduplicate aggregates and transition functions in planner
Previous Message Thomas Kellerer 2020-11-24 08:22:26 Re: Migration Oracle multitenant database to PostgreSQL ?