Re: Bulk Inserts

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-18 04:52:21
Message-ID: f67928030909172152m9e54850vba8e38a61e96f93f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2009/9/16 Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>

>
> OK, that makes sense. I thought you had hacked either XLogInsert or the
>> heap WAL replay code so that you could just accumulate tuples in the rdata
>> chain and then submit them all under the cover of a single WALInsertLock.
>>
>
> Ah, no, I did not do that.
>
> This would be difficult to do : rdata chain contains buffer pointers, and
> when we are in XLogInsert, we have an exclusive lock on the buffer. If
> XLogInsert accumulated xlog records as you say, then later, when it's time
> to write them to the xlog, it would no longer hold exclusive lock on the
> buffer, so its contents could have changed, and if XLogInsert decides to do
> a full page write, the contents will be wrong.
>

Yes, I didn't mean to make XLogInsert unilaterally accumulate rdata
(Actually I have done that, purely as a proof of concept. The resulting
database is completely unrecoverable, but as long you don't bring it down,
it runs fine and lets me test the speed of different concepts without going
to the trouble of implementing them correctly).

heap_bulk_insert would do the accumulation. The hack to XLogInsert would
involve making it insert an extra dummy xlog record header for every tuple
in the rdata chain. Alternatively, the hack to heap replay would involve
making it accept multiple tuples reported under a single WAL record. I
don't know which one would be easier.

>
> Besides, the LSN needs to be set in the page at every call to heap_insert
> (or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff before
> writing it, it would need a mechanism to assign a LSN without having written
> the xlog data yet.
>

Right, XLogInsert would only be able to accumulate as long as it knew it was
going to get called again before the buffer exclusive lock was released.
That is why the accumulation is better done in the heap_bulk_insert,
otherwise it would require an unfortunate amount of communication between
the two.

> If you haven't done that, then of course doing the bulk insert doesn't
>> help much if you still do tuple-by-tuple XLogInsert.
>>
>
> Exactly.
>
> So in the case that it is
>> under the limit, you first run through the tuples putting them into the
>> block, then run through the tuples again doing the XLogInserts?
>>
>
> Yes, exactly. This isn't really optimal...
> I wonder if I could build one rdata chain containing all updates to the
> tuples and submit it in one go. Would it work ?...

I'm sure it could be made to work. I haven't checked the replay code, but I
doubt it would work on this massed record right out of the box. Or we could
insert dummy headers between the each tuple's WAL data.

...

> So in order to benchmark the right thing, I have :
> - all the tables in a big RAMDISK
> - xlog on RAID5
> - fsync=fdatasync
>
> I've also found that setting wal_buffers to a large value like 128MB gives
> a significant speed boost when doing COPY or INSERT INTO SELECT, probably
> because it allows the backends to always find space in the buffers even if
> the walwriter is a bit busy.

That seems very large, even for the high throughput set up you describe. Is
the WAL background writer set at the default interval? (On the other hand,
if 128MB is just a rounding error in your machine's total RAM size, why not
be generous? On the other other hand, I've seen perfomance start to drop as
wal_buffers gets too large, though I never bothered to chased down the
cause.)

At one point, I think that XLogInsert would synchronously write out the
buffer when it noticed it was over half full, but that got ripped out (if
I'm going to block, I might as well wait until it actually is full to do).
Now that there is a background writer, maybe it would be a good idea to have
XLogInserters wake it up if the buffer is half full.
...

> Another angle of attack would be to make wal-writing more efficient...
>>>
>> If you mean to do this without changing the xlog interfaces, I'm not
>> optimistic.
>>
>
> Agree : that's why I didn't even try ;)
>
> If you have to call XLogInsert once per row that is copied (or
>> insert...select), then my experiments show that simply taking the
>> WALInsertLock and immediately releasing it, doing absolutely no real work
>> while it is held, is already a substanial multi-core scalibility
>> bottleneck.
>>
>
> Confirmed by context switch issue above...
>
> Having all cores block on the same lock for every row can be OK if it's a
> spinlock protecting just a few lines of code... which is not the present
> case...

Maybe there could be some hybrid approach. You take the spinlock, check
that you could get the lwlock if you wanted to. If you could get the lwlock
and know you have almost no work to do, then just do it while holding the
spinlock. If you discover you have more work todo (xlog file switch,
nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop the
spinlock. This *really* breaks the encapsulation of lwlock, so it is
probably not much more than idle speculation.

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2009-09-18 05:00:42 pgsql: Easier to translate psql help Instead of requiring translators
Previous Message Emmanuel Cecchet 2009-09-18 04:14:18 Re: COPY enhancements