Re: POC: postgres_fdw insert batching

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: POC: postgres_fdw insert batching
Date: 2020-07-01 18:34:33
Message-ID: 20200701183433.nw45gul5o4ogyl6t@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-06-28 17:10:02 +0200, Tomas Vondra wrote:
> 3) Should we do batching for COPY insteads?
>
> While looking at multi_insert, I've realized it's mostly exactly what
> the new "batching insert" API function would need to be. But it's only
> really used in COPY, so I wonder if we should just abandon the idea of
> batching INSERTs and do batching COPY for FDW tables.
>
> For cases that can replace INSERT with COPY this would be enough, but
> unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do
> this :-(

I personally think - and I realize that that might be annoying to
somebody looking to make an incremental improvement - that the
nodeModifyTable.c and copy.c code dealing with DML has become too
complicated to add features like this without a larger
refactoring. Leading to choices like this, whether to add a feature in
one place but not the other.

I think before we add more complexity, we ought to centralize and clean
up the DML handling code so most is shared between copy.c and
nodeModifyTable.c. Then we can much more easily add batching to FDWs, to
CTAS, to INSERT INTO SELECT etc, for which there are patches already.

> 4) Expected consistency?
>
> I'm not entirely sure what are the consistency expectations for FDWs.
> Currently the FDW nodes pointing to the same server share a connection,
> so the inserted rows might be visible to other nodes. But if we only
> stash the rows in a local buffer for a while, that's no longer true. So
> maybe this breaks the consistency expectations?

Given that for local queries that's not the case (since the snapshot
won't have those changes visible), I think we shouldn't be too concerned
about that. If anything we should be concerned about the opposite.

If we are concerned, perhaps we could add functionality to flush all
pending changes before executing further statements?

> I was thinking about adding a GUC to enable/disable the batching at some
> level (global, server, table, ...) but it seems like a bad match because
> the consistency expectations likely depend on a query. There should be a
> GUC to set the batch size, though (it's hardcoded to 100 for now).

Hm. If libpq allowed to utilize pipelining ISTM the answer here would be
to not batch by building a single statement with all rows as a VALUES,
but issue the single INSERTs in a pipelined manner. That'd probably
remove all behavioural differences. I really wish somebody would pick
up that libpq patch again.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2020-07-01 19:16:22 Re: [PATCH] fix GIN index search sometimes losing results
Previous Message Vik Fearing 2020-07-01 18:19:25 Re: SQL-standard function body