RE: POC: postgres_fdw insert batching

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Craig Ringer' <craig(dot)ringer(at)enterprisedb(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: RE: POC: postgres_fdw insert batching
Date: 2020-11-30 09:13:58
Message-ID: TYAPR01MB2990666A2BAC075F3A3FD1CEFEF50@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

From: Craig Ringer <craig(dot)ringer(at)enterprisedb(dot)com>
> It was not my intention to hold this patch up or greatly expand its
> scope. I'll spend some time testing it out and have a closer read soon
> to see if I can help progress it.

Thank you, I'm relieved to hear that. Last weekend, I was scared of a possible mood that's something like "We won't accept the insert speedup patch for foreign tables unless you take full advantage of pipelining and achieve maximum conceivable speed!"

> I thought I gave it at the time, and a demo program. IIRC it was just
> doing small multi row inserts or single row inserts. Latency would've
> been a couple of hundred ms probably, I think I did something like
> running on my laptop (Australia, ADSL) to a server on AWS US or EU.

a couple of hundred ms, so that would be dominant in each prepare-send-execute-receive, possibly even for batch insert with hundreds of rows in each batch. Then, the synchronous batch insert of the current patch may achieve a few hundreds times speedup compared to a single row inserts when the batch size is hundreds or more.

> > I'd like to check other DBMSs and your rich references for the FDW interface.
> (My first intuition is that many major DBMSs might not have client C APIs that
> can be used to implement an async pipelining FDW interface.
>
> Likely correct for C APIs of other traditional DBMSes. I'd be less
> sure about newer non SQL ones, especially cloud oriented. For example
> DynamoDB supports at least async requests in the Java client [3] and
> C++ client [4]; it's not immediately clear if requests can be
> pipelined, but the API suggests they can.

I've checked ODBC, MySQL, Microsoft Synapse Analytics, Redshift, and BigQuery, guessing that the data warehouse may have asynchronous/pipelining API that enables efficient data integration/migration. But none of them had one. (I seem to have spent too long and am a bit tired... but it was a bit fun as well.) They all support INSERT with multiple records in its VALUES clause. So, it will be useful to provide a synchronous batch insert FDW API. I guess Oracle's OCI has an asynchronous API, but I didn't check it.

As an aside, MySQL 8.0.16 added support for asynchronous execution in its C API, but it allows only one active SQL statement in each connection. Likewise, although the ODBC standard defines asynchronous execution (SQLSetStmtAttr(SQL_ASYNC_ENABLE) and SQLCompleteAsync), SQL Server and Synapse Analytics only allows only one active statement per connection. psqlODBC doesn't support asynchronous execution.

> Most things with a REST-like API can do a fair bit of concurrency
> though. Multiple async nonblocking HTTP connections can be serviced at
> once. Or HTTP/1.1 pipelining can be used [1], or even better HTTP/2.0
> streams [2]. This is relevant for any REST-like API.

I'm not sure if this is related, Google deprecated Batch HTTP API [1].

[1]
https://cloud.google.com/bigquery/batch

Regards
Takayuki Tsunakawa

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2020-11-30 09:27:16 Re: [PATCH] BUG FIX: Core dump could happen when VACUUM FULL in standalone mode
Previous Message Dilip Kumar 2020-11-30 09:10:08 Re: Is Recovery actually paused?