Re: should INSERT SELECT use a BulkInsertState?

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: Luc Vlaming <luc(at)swarm64(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: should INSERT SELECT use a BulkInsertState?
Date: 2020-11-02 18:45:51
Message-ID: 20201102184551.GJ22691@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 02, 2020 at 07:53:45AM +0100, Luc Vlaming wrote:
> On 30.10.20 05:51, Justin Pryzby wrote:
> > On Thu, Oct 22, 2020 at 01:29:53PM +0100, Simon Riggs wrote:
> > > On Fri, 16 Oct 2020 at 22:05, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
> > >
> > > > > > I made this conditional on BEGIN BULK/SET bulk, so I'll solicit comments on that.
> > >
> > > I think it would be better if this was self-tuning. So that we don't
> > > allocate a bulkinsert state until we've done say 100 (?) rows
> > > inserted.
> >
> > I made it an optional, non-default behavior in response to the legitimate
> > concern for performance regression for the cases where a loader needs to be as
> > fast as possible - as compared with our case, where we want instead to optimize
> > for our reports by making the loaders responsible for their own writes, rather
> > than leaving behind many dirty pages, and clobbering the cache, too.
> >
> > Also, INSERT SELECT doesn't immediately help us (telsasoft), since we use
> > INSERT .. VALUES () .. ON CONFLICT. This would handle that case, which is
> > great, even though that wasn't a design goal. It could also be an integer GUC
> > to allow configuring the size of the ring buffer.
> >
> > > You should also use table_multi_insert() since that will give further
> > > performance gains by reducing block access overheads. Switching from
> > > single row to multi-row should also only happen once we've loaded a
> > > few rows, so we don't introduce overahads for smaller SQL statements.
> >
> > Good idea...multi_insert (which reduces the overhead of individual inserts) is
> > mostly independent from BulkInsert state (which uses a ring-buffer to avoid
> > dirtying the cache). I made this 0002.
> >
> > This makes INSERT SELECT several times faster, and not clobber the cache too.
> >
> > Time: 4700.606 ms (00:04.701)
> > 123 | 1
> > 37 | 2
> > 20 | 3
> > 11 | 4
> > 4537 | 5
> > 11656 |
> >
> > Time: 1125.302 ms (00:01.125)
> > 2171 | 1
> > 37 | 2
> > 20 | 3
> > 11 | 4
> > 111 | 5
> > 14034 |
> >
> > When enabled, this passes nearly all regression tests, and all but 2 of the
> > changes are easily understood. The 2nd patch still needs work.
> >
>
> Hi,
>
> Came across this thread because I'm working on an improvement for the
> relation extension to improve the speed of the bulkinsert itself in (highly)
> parallel cases and would like to make sure that our approaches work nicely

Thanks for looking.

Since this is a GUC, I thought it would accomodate users optimizing for either
inserts vs selects, as well as users who don't want to change their application
(they can "ALTER SYSTEM SET bulk_insert=on"). I'm not thrilled about making a
new guc, but that seems to be required for "begin bulk", which was the obvious
way to make it an 'opt-in' feature.

I guess it'd be easy to add a counter to ModifyTableState, although it makes
the code a bit less clean and conceivably performs "discontinuously" - inserts
100rows/sec for the first 999 rows and then 200rows/sec afterwards.

If you "mix" small inserts and big inserts, it would be a bad strategy to
optimize for the small ones. Anyway, in a quick test, small inserts were not
slower.
https://www.postgresql.org/message-id/20200713015700.GA23581%40telsasoft.com

Do you have an example that regresses with bulk insert ?

The two patches are separate, and it's possible they should be enabled
differently or independently.

--
Justin

> Given what I've seen and tried so far with various benchmarks I would also
> really like to see a different approach here. The "BEGIN BULK" can be
> problematic for example if you mix small amounts of inserts and big amounts
> in the same transaction, or if your application possibly does a bulk insert
> but otherwise mostly OLTP transactions.

> To me the idea from Simon sounds good to only use a bulk insert state after
> inserting e.g. a 1000 rows, and this also seems more applicable to most
> applications compared to requiring a change to any application that wishes
> to have faster ingest.
>
> Another approach could be to combine this, for example, with a few extra
> requirements to limit the amount of regressions and first learn more how
> this behaves in the field.
> We could, for example, only (just throwing out some ideas), require that:
> - the relation has a certain size
> - a BufferStrategy a maximum certain size is used
> - there is a certain amount of lock waiters on relation extension. (like we
> do with bulk extend)
> - we have extended the relation for at least e.g. 4 MB and not used the FSM
> anymore thereby proving that we are doing bulk operations instead of random
> small extensions everywhere into the relation that use the FSM.
>
> Another thing is that we first try to improve the bulk operation facilities
> in general and then have another shot at this? Not sure if there is some
> benchmark / query that shows where such a 10x slowdown would appear but
> maybe that would be worth a look as well possibly.
>
> Regards,
> Luc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2020-11-02 18:59:11 RE: pgbench: option delaying queries till connections establishment?
Previous Message Stephen Frost 2020-11-02 18:41:09 Re: public schema default ACL