Re: Bulk Inserts

From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bulk Inserts
Date: 2009-09-16 10:15:34
Message-ID: op.u0chf8qwcke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> OK, that makes sense. I thought you had hacked either XLogInsert or the
> heap WAL replay code so that you could just accumulate tuples in the
> rdata
> chain and then submit them all under the cover of a single WALInsertLock.

Ah, no, I did not do that.

This would be difficult to do : rdata chain contains buffer pointers, and
when we are in XLogInsert, we have an exclusive lock on the buffer. If
XLogInsert accumulated xlog records as you say, then later, when it's time
to write them to the xlog, it would no longer hold exclusive lock on the
buffer, so its contents could have changed, and if XLogInsert decides to
do a full page write, the contents will be wrong.

Besides, the LSN needs to be set in the page at every call to heap_insert
(or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff
before writing it, it would need a mechanism to assign a LSN without
having written the xlog data yet.

> If you haven't done that, then of course doing the bulk insert doesn't
> help much if you still do tuple-by-tuple XLogInsert.

Exactly.

> So in the case that it is
> under the limit, you first run through the tuples putting them into the
> block, then run through the tuples again doing the XLogInserts?

Yes, exactly. This isn't really optimal...
I wonder if I could build one rdata chain containing all updates to the
tuples and submit it in one go. Would it work ?...

> Do you have an IO bottleneck even in the absence of fsyncs?

Sometimes, yes :

- When xlog is on a (rather slow) software RAID1, I've found that the
fsyncs which happen when switching to a new xlog segment are significant,
and the amount of log data written is dangerously close to the max
throughput, too.
- When xlog is on a much faster RAID5, there is no such problem. fsync on
raid5 is horrendously slow if you have many pending small writes, but if
all you need to sync is a 16MB file which nicely aligns on raid stripes,
it's not a problem.

So in order to benchmark the right thing, I have :
- all the tables in a big RAMDISK
- xlog on RAID5
- fsync=fdatasync

I've also found that setting wal_buffers to a large value like 128MB gives
a significant speed boost when doing COPY or INSERT INTO SELECT, probably
because it allows the backends to always find space in the buffers even if
the walwriter is a bit busy.

> My experience
> on multi-core machines with decent IO systems has been that the amount of
> WAL traffic (by volume) matters rather little, as opposed to the number
> WALInsertLocks taken, which matter quite a bit. Of course this depends
> quite a bit on your OS and hardware.

Exactly : with the configuration above, iowait is extremely small (<1%) ,
yet all processes still spend 50% of their time waiting on WALInsertLock.

The lock is held for about 15% of the total time ; with 4 cores and 4
threads, if a core spends X time holding the lock, it's logical that the
others spend 3*X time waiting on it.

But consider this :

- 4 processes spend 50% of their time waiting on the lock
- therefore at any time there are average 2 processes waiting
- therefore every time the lock is Released, a process is woken up

=> more than 200.000 context switches per second seen in vmstat

Actually it does 1 context switch every two inserted rows, which is
enormous...

I've done a bit of profiling and found that 70% of the time spent holding
this lock is... computing the header's CRC.
Just for a quick check, I removed all CRC calculations. There was
absolutely no performance gain.

>> Another angle of attack would be to make wal-writing more efficient...
> If you mean to do this without changing the xlog interfaces, I'm not
> optimistic.

Agree : that's why I didn't even try ;)

> If you have to call XLogInsert once per row that is copied (or
> insert...select), then my experiments show that simply taking the
> WALInsertLock and immediately releasing it, doing absolutely no real work
> while it is held, is already a substanial multi-core scalibility
> bottleneck.

Confirmed by context switch issue above...

Having all cores block on the same lock for every row can be OK if it's a
spinlock protecting just a few lines of code... which is not the present
case...

> Once we accept that this must be done, the next existing
> bottleneck is the memcpy of the first byte from the rdata chain into the
> shared wal_buffers, presumably because this copy involves fighting the
> cache
> line away from other cores. Once you've copied the first byte, the rest
> of
> them seem to be almost free. (Again, this is probably hardware and
> situation dependent).

Since I have one 4 core CPU I probably don't see this effect (a multi-cpu
case would).

> I've seen some suggestions that the wal_buffer block initation work be
> moved
> from being done by AdvanceXLInsert to instead be done by XLogWrite.
> However, I've not seen any indication that AdvanceXLInsert is a
> meaningful
> bottlneck in the first place. Except when wal_buffers is too small: then
> AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting
> called from within it, in which case moving work from one to the other is
> probably not going to make things better.

That correlates with the benefits I saw by increasing wal_buffers to a
very large value...

I'm going to try other things and report back.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2009-09-16 10:41:15 Re: Streaming Replication patch for CommitFest 2009-09
Previous Message Bernd Helmle 2009-09-16 08:38:24 Re: Hot Standby 0.2.1