Re: should INSERT SELECT use a BulkInsertState?

From: Luc Vlaming <luc(at)swarm64(dot)com>
To: Justin Pryzby <pryzby(at)telsasoft(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: should INSERT SELECT use a BulkInsertState?
Date: 2020-11-02 06:53:45
Message-ID: bd0dc980-c4c9-29c9-8cb6-8d7787ae9fef@swarm64.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30.10.20 05:51, Justin Pryzby wrote:
> On Thu, Oct 22, 2020 at 01:29:53PM +0100, Simon Riggs wrote:
>> On Fri, 16 Oct 2020 at 22:05, Justin Pryzby <pryzby(at)telsasoft(dot)com> wrote:
>>
>>>>> I made this conditional on BEGIN BULK/SET bulk, so I'll solicit comments on that.
>>
>> I think it would be better if this was self-tuning. So that we don't
>> allocate a bulkinsert state until we've done say 100 (?) rows
>> inserted.
>
> I made it an optional, non-default behavior in response to the legitimate
> concern for performance regression for the cases where a loader needs to be as
> fast as possible - as compared with our case, where we want instead to optimize
> for our reports by making the loaders responsible for their own writes, rather
> than leaving behind many dirty pages, and clobbering the cache, too.
>
> Also, INSERT SELECT doesn't immediately help us (telsasoft), since we use
> INSERT .. VALUES () .. ON CONFLICT. This would handle that case, which is
> great, even though that wasn't a design goal. It could also be an integer GUC
> to allow configuring the size of the ring buffer.
>
>> You should also use table_multi_insert() since that will give further
>> performance gains by reducing block access overheads. Switching from
>> single row to multi-row should also only happen once we've loaded a
>> few rows, so we don't introduce overahads for smaller SQL statements.
>
> Good idea...multi_insert (which reduces the overhead of individual inserts) is
> mostly independent from BulkInsert state (which uses a ring-buffer to avoid
> dirtying the cache). I made this 0002.
>
> This makes INSERT SELECT several times faster, and not clobber the cache too.
>
> Time: 4700.606 ms (00:04.701)
> 123 | 1
> 37 | 2
> 20 | 3
> 11 | 4
> 4537 | 5
> 11656 |
>
> Time: 1125.302 ms (00:01.125)
> 2171 | 1
> 37 | 2
> 20 | 3
> 11 | 4
> 111 | 5
> 14034 |
>
> When enabled, this passes nearly all regression tests, and all but 2 of the
> changes are easily understood. The 2nd patch still needs work.
>

Hi,

Came across this thread because I'm working on an improvement for the
relation extension to improve the speed of the bulkinsert itself in
(highly) parallel cases and would like to make sure that our approaches
work nicely together.

Given what I've seen and tried so far with various benchmarks I would
also really like to see a different approach here. The "BEGIN BULK" can
be problematic for example if you mix small amounts of inserts and big
amounts in the same transaction, or if your application possibly does a
bulk insert but otherwise mostly OLTP transactions.

To me the idea from Simon sounds good to only use a bulk insert state
after inserting e.g. a 1000 rows, and this also seems more applicable to
most applications compared to requiring a change to any application that
wishes to have faster ingest.

Another approach could be to combine this, for example, with a few extra
requirements to limit the amount of regressions and first learn more how
this behaves in the field.
We could, for example, only (just throwing out some ideas), require that:
- the relation has a certain size
- a BufferStrategy a maximum certain size is used
- there is a certain amount of lock waiters on relation extension. (like
we do with bulk extend)
- we have extended the relation for at least e.g. 4 MB and not used the
FSM anymore thereby proving that we are doing bulk operations instead of
random small extensions everywhere into the relation that use the FSM.

Another thing is that we first try to improve the bulk operation
facilities in general and then have another shot at this? Not sure if
there is some benchmark / query that shows where such a 10x slowdown
would appear but maybe that would be worth a look as well possibly.

Regards,
Luc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Julien Rouhaud 2020-11-02 06:59:10 Re: Collation versioning
Previous Message Greg Nancarrow 2020-11-02 06:39:00 Re: Parallel INSERT (INTO ... SELECT ...)