Re: should INSERT SELECT use a BulkInsertState?

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Zhihong Yu <zyu(at)yugabyte(dot)com>
Cc: Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Luc Vlaming <luc(at)swarm64(dot)com>, houzj(dot)fnst(at)fujitsu(dot)com, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: should INSERT SELECT use a BulkInsertState?
Date: 2021-09-27 00:08:45
Message-ID: 20210927000845.GB831@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, May 11, 2020 at 03:19:34PM +0900, Michael Paquier wrote:
> On Fri, May 08, 2020 at 02:25:45AM -0500, Justin Pryzby wrote:
> > Seems to me it should, at least conditionally. At least if there's a function
> > scan or a relation or ..
> >
> > I mentioned a bit about our use-case here:
> > https://www.postgresql.org/message-id/20200219173742.GA30939%40telsasoft.com
> > => I'd prefer our loaders to write their own data rather than dirtying large
> > fractions of buffer cache and leaving it around for other backends to clean up.
>
> Does it matter in terms of performance and for which cases does it
> actually matter?

Every 15min we're inserting 10s of thousands of rows, which dirties a large
number of buffers:

postgres=# CREATE EXTENSION pg_buffercache; DROP TABLE tt; CREATE TABLE tt(i int); INSERT INTO tt SELECT generate_series(1,999999); SELECT usagecount,COUNT(1) FROM pg_buffercache WHERE isdirty GROUP BY 1 ORDER BY 1;
usagecount | count
------------+-------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 4436

With this patch it dirties fewer pages and with lower usage count:

1 | 2052
2 | 1
3 | 3
4 | 2
5 | 10

The goal is to avoid cache churn by using a small ring buffer.
Note that indexes on the target table will themselves use up buffers, and
BulkInsert won't help so much.

On Thu, Mar 18, 2021 at 08:29:50AM -0700, Zhihong Yu wrote:
> + mtstate->ntuples > bulk_insert_ntuples &&
> + bulk_insert_ntuples >= 0)
>
> I wonder why bulk_insert_ntuples == 0 is included in the above. It
> seems bulk_insert_ntuples having value of 0 should mean not enabling bulk
> insertions.

I think it ought to be possible to enable bulk insertions immediately, which is
what 0 does. -1 is the value defined to mean "do not use bulk insert".
I realize there's no documentation yet.

This patch started out with the goal of using a BulkInsertState for INSERT,
same as for COPY. We use INSERT ON CONFLICT VALUES(),()..., and it'd be nice
if our data loaders could avoid leaving behind dirty buffers.

Simon suggested to also use MultInsertInfo. However that patch is complicated:
it cannot work with volatile default expressions, and probably many other
things that could go in SELECT but not supported by miistate. That may be
better handled by another patch (or in any case by someone else) like this one:
| New Table Access Methods for Multi and Single Inserts
https://commitfest.postgresql.org/31/2871/

I split off the simple part again. If there's no interest in the 0001 patch
alone, then I guess it should be closed in the CF.

However, the "new table AM" patch doesn't address our case, since neither
VALUES nor INSERT SELECT are considered a bulk insert..

--
Justin

Attachment Content-Type Size
v11-0001-ExecInsert-to-use-BulkInsertState.patch text/x-diff 6.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuro Yamada 2021-09-27 00:13:19 Add prefix pg_catalog to pg_get_statisticsobjdef_columns() in describe.c (\dX)
Previous Message Tom Lane 2021-09-26 23:25:25 Re: When is int32 not an int32?