Quick Links

Re: Bulk Inserts and WAL Inserts

From:	Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To:	"Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Bulk Inserts and WAL Inserts
Date:	2009-09-25 13:58:36
Message-ID:	op.u0tfrys1cke6l8@soyouz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

> heap_bulk_insert would do the accumulation. The hack to XLogInsert would
> involve making it insert an extra dummy xlog record header for every
> tuple

WAL record header is something like 40 bytes, so if you make lots of
inserts in a page, you'd be better off WALing the whole page, it takes
less space, it is much faster, and if you're coming right after a
checkpoint, you're going to do it anyway on the first inserted row, so it
would be even better to do it on the last inserted row...

WAL Insert record is 55 bytes + tuple data

However, you can't hold an exclusive lock on a buffer while going out in
the executor to fetch the next tuple, since that can take a undetermined
amount of time : imagine the record comes from a dblink() and there is a
connection loss... so you'd need a TupleBuffer, something like a
tuplestore that doesn't spill to disk and holds only about 32 kB of
tuples, and :
- fill buffer with tuples coming from the executor (or COPY),
- pass this to heap_bulk_insert(),
- toasts tuples (maybe also bulk-insert in the TOAST table)
- take the exclusive lock on the buffer,
- insert tuples quickly,
- log the whole page when it's full, or log individual inserts (in 1
operation)
- release the lock

> in the rdata chain. Alternatively, the hack to heap replay would involve
> making it accept multiple tuples reported under a single WAL record. I
> don't know which one would be easier.

This one is probably easier, since all your WAL records refer to the same
buffer.

> I'm sure it could be made to work. I haven't checked the replay code,
> but I
> doubt it would work on this massed record right out of the box. Or we
> could
> insert dummy headers between the each tuple's WAL data.

For small rows, the header is as large as the row... might as well get rid
of it !

Inserting dummy headers would not work, here's why :

- The critical part (lock-wise) of XLogInsert is generating a LSN
- The LSN calculation needs to verify that the header isn't split between
pages
- Therefore, if you want to insert N records in one operation, then this
critical part is going to take quite some time
- And the lock should be held for as little time as possible...

> That seems very large, even for the high throughput set up you
> describe. Is
> the WAL background writer set at the default interval? (On the other
> hand,
> if 128MB is just a rounding error in your machine's total RAM size, why
> not
> be generous? On the other other hand, I've seen perfomance start to
> drop as
> wal_buffers gets too large, though I never bothered to chased down the
> cause.)
>
> At one point, I think that XLogInsert would synchronously write out the
> buffer when it noticed it was over half full, but that got ripped out (if
> I'm going to block, I might as well wait until it actually is full to
> do).
> Now that there is a background writer, maybe it would be a good idea to
> have
> XLogInserters wake it up if the buffer is half full.
> ...

Here's what I think :

Currently, using big WAL buffers decreases performance (while you'd expect
to increase it). Here's why.

Case 1) consider those concurrent transactions :
- Transaction 1 : do a big operation (vacuum, COPY, etc)
- Transaction 2 : BEGIN; do a few little UPDATEs... COMMIT

Suppose you have a large wal_buffer like 32MB. By the time transaction 2
wants to commit and sync, transaction 1 has inserted lots of WAL in the
buffer, so transaction 1 finds it needs to write and fsync() like 2x 16 MB
WAL segments. Needless to say, transaction 2 commit is going to be pretty
slow. With small buffers, all the stuff would have been in the OS cache if
not already on the disks.

Case 2) you just do a big COPY, or a big VACUUM that writes lots of xlog.
Suppose you're generating like 50 MB/s of WAL.
You've set your WAL writer delay to something very low like 20 ms, so the
buffers are nicely emptied...
However, with this throughput, you're going through 1 xlog segment every
300 ms, so when WALWrite() tries to fsync() the segment to advance to the
next, the data to write is in the OS cache, but the OS probably hasn't
decided to write it to disk yet, so your fsync call is very expensive, and
it is blocking, this means if your buffers are smaller than a WAL segment,
then the buffers are full while you wait for fsync, and everything stands
still.

Here's my current test setup :

- WAL on a dedicated disk (this is really good for performance unless you
got a battery backed up controller)
- fsync=o_dsync
- very aggressive WAL writer delay like 5 ms

Basically, all the WAL disk does is write WAL, so the head hardly moves at
all. The disk takes about 8 ms for a rotation (this is less than
wal_writer_delay). Whenever there is WAL to write, walwriter writes it,
and blocks because of O_DSYNC mode. So, basically, at each disk rotation,
the pending WAL is written. Then, fsync() is a noop. This gives excellent,
and very smooth performance, whereas fsync() gives quite random, and
sometimes pretty high, wait times, since the length of fsync() wait
depends on how much unflushed data sits in the OS buffers.

However there is a problem with O_DSYNC :

- Latency is excellent, but throughput is miserable because you can only
write 1 chunk per disk rotation.
- If all you need to write is a few pages, and you want to sync them, it's
perfect.
- If you want to write at a high throughput, though, you'd better write
data in LARGE chunks, ideally 1 WAL segment at a time.

So, O_DSYNC is much better with large WAL buffers (>16 MB) and small
walwriter delay, ideally delay < disk rotation.

> Maybe there could be some hybrid approach. You take the spinlock, check
> that you could get the lwlock if you wanted to. If you could get the
> lwlock
> and know you have almost no work to do, then just do it while holding the
> spinlock. If you discover you have more work todo (xlog file switch,
> nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop
> the
> spinlock. This *really* breaks the encapsulation of lwlock, so it is
> probably not much more than idle speculation.

That's a description of something that could look like a futex ;)

Anyway, I've made a little patch (attached).
It applies to git revision 1384847ef8731718a79a32cd354e31c31c5294a0 of
current postgres (from last week), it probably applies to current HEAD,
but I have to learn how to do a merge in git before I can tell you that ;)

What it does :

I've put extremely detailed comments in xlog.c, this is a brief summary :

Since the critical part of XLogInsert is generating a LSN, not writing the
data, I have created a LSN generator, which is extremely fast, so it is
protected by a spinlock. Using anything else than a spinlock creates a
nice point of serialization and kills all performance gains.

Concurrent XLogInserts get a LSN from the LSN generator, and then they
insert their data in the buffers, concurrently, under a LW_SHARED
WALInsertLock.

The buffer queue logic was entirely redone too.

XlogWrite marks buffers as free as soon as possible, so if you use 32 MB
wal_buffers, when XLogWrite writes data, the buffer space is reused
immediately, while the previous data is being fsync()ed.

To avoid writing partial WAL records to disk, we must be sure that all
records in the buffer are completely written. This is done by taking an
exclusive lock on WALInsertLock, which waits for all memory writes to
finish, then taking a peek at the last written record, and releasing the
lock immediately.

I've also added an adaptive walwriter delay (depending on load).

I've added a lot of comments in xlog.c, check them out.

Here is the configuration I use for testing :

fsync = on
synchronous_commit = on or off depending on test
wal_sync_method = open_sync
full_page_writes = on
wal_buffers = 32MB
wal_writer_delay = 100ms (adaptive walwriter delay lowers it if needed)

Benchmarks are parallel COPY, and parallel INSERT INTO SELECT. However, be
careful with parallel INSERT INTO SELECT if all your threads insert into
the same table : you'll be benchmarking the FSM more than the WAL itself...

On my setup, 8.5 is bound by the WALInsertLock, but with this patch, my
disks are often maxed out, so I can't really tell. I had to put the tables
on a ramdisk.
I get an almost x2 speedup on parallel COPY of a table with 9 INT fields,
then cpu is maxed out parsing integers...
If you have more than 4 cores and faster disks, I'd be very interested in
your results !

It will print some stuff to the console, if you see lots of ">>W>W>>>W",
it means your WAL writing throughput is maxed out.
"W" means it had to flush buffers in XLogInsert (bad) instead of somewhere
non-critical like the wal writer.
"<" and ">" mean start and end of exclusive lock on WALInsert taken
because it had to flush.

Attached is a little Python benchmark script. Change the definitions for
where to put the files if you want to use it.

Regards,

Pierre

Attachment	Content-Type	Size
parallel_test_v5.py	text/x-python	8.0 KB
xlog_perf_hack_v1.patch.gz	application/x-gzip	29.7 KB

In response to

Re: Bulk Inserts at 2009-09-18 04:52:21 from Jeff Janes

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Emmanuel Cecchet	2009-09-25 14:01:27	Re: COPY enhancements
Previous Message	Merlin Moncure	2009-09-25 13:48:51	Re: plpgsql function is so slow