Re: faster ETL / bulk data load for heap tables

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Luc Vlaming <luc(at)swarm64(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: faster ETL / bulk data load for heap tables
Date: 2021-01-01 18:55:13
Message-ID: CALNJ-vQqUw5LOjEtyV80-V6knSy7j4ovjfGf0nBVSBvH6Ym=1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, Luc:
Happy New Year.

Looking at BufferAllocExtend()
in v1-0002-WIP-buffer-alloc-specialized-for-relation-extensi.patch. it
seems there is duplicate code with the existing BufferAlloc().

It would be good if some refactoring is done by extracting common code into
a helper function.

Thanks

On Fri, Jan 1, 2021 at 6:07 AM Luc Vlaming <luc(at)swarm64(dot)com> wrote:

> Hi,
>
> In an effort to speed up bulk data loading/transforming I noticed that
> considerable time is spent in the relation extension lock. I know there
> are already many other efforts to increase the chances of using bulk
> loading [1], [2], [3], [4], efforts to make loading more parallel [5],
> and removing some syscalls [6], as well as completely new I/O systems
> [7] and some more extreme measures like disabling wal logging [8].
>
> Whilst they will all help, they will ultimately be stopped by the
> relation extension lock. Also it seems from the tests I've done so far
> that at least for bulk loading using pwrite() actually can carry us
> rather far as long as we are not doing it under a global lock. Moreover,
> the solution provided here might be an alternative to [7] because the
> results are quite promising, even with WAL enabled.
>
> Attached two WIP patches in the hopes of getting feedback.
> The first changes the way we do bulk loading: each backend now gets a
> standalone set of blocks allocated that are local to that backend. This
> helps both with reducing contention but also with some OS writeback
> mechanisms.
> The second patch reduces the time spent on locking the partition buffers
> by shifting around the logic to make each set of 128 blocks use the same
> buffer partition, and then adding a custom function to get buffer blocks
> specifically for extension, whilst keeping a previous partition lock,
> thereby reducing the amount of time we spent on futexes.
>
> The design:
> - add a set of buffers into the BulkInsertState that can be used by the
> backend without any locks.
> - add a ReadBufferExtendBulk function which extends the relation with
> BULK_INSERT_BATCH_SIZE blocks at once.
> - rework FileWrite to have a parameter to speed up relation extension by
> passing in if we are using filewrite just to extend the file. if
> supported uses ftruncate as this is much faster (also than
> posix_fallocate on my system) and according to the manpages
> (https://linux.die.net/man/2/ftruncate) should read as zeroed space. to
> be cleaned-up later possibly into a special function FileExtend().
> - rework mdextend to get a page count.
> - make a specialized version of BufferAlloc called BufferAllocExtend
> which keeps around the lock on the last buffer partition and tries to
> reuse this so that there are a lot less futex calls.
>
> Many things that are still to be done; some are:
> - reintroduce FSM again, and possibly optimize the lock usage there. in
> other words: this patch currently can only start the server and run COPY
> FROM and read queries.
> - look into the wal logging. whilst it seems to be fairly optimal
> already wrt the locking and such i noticed there seems to be an
> alternating pattern between the bgwriter and the workers. whilst setting
> some parameters bigger helped a lot (wal_writer_flush_after,
> wal_writer_delay, wal_buffers)
> - work nicely with the code from [6] so that the register_dirty_segment
> is indeed not needed anymore; or alternatively optimize that code so
> that less locks are needed.
> - make use of [9] in the fallback case in FileWrite() when
> ftruncate/fallocate is not available so that the buffer size can be
> reduced.
>
> First results are below; all tests were loading 32 times the same 1G
> lineitem csv into the same table. tests were done both on a nvme and the
> more parallel ones also with a tmpfs disk to see potential disk
> bottlenecks and e.g. potential wrt using NVDIMM.
> =================================
> using an unlogged table:
> NVME, UNLOGGED table, 4 parallel streams: HEAD 171s, patched 113s
> NVME, UNLOGGED table, 8 parallel streams: HEAD 113s, patched 67s
> NVME, UNLOGGED table, 16 parallel streams: HEAD 112s, patched 42s
> NVME, UNLOGGED table, 32 parallel streams: HEAD 121s, patched 36s
> tmpfs, UNLOGGED table, 16 parallel streams: HEAD 96s, patched 38s
> tmpfs, UNLOGGED table, 32 parallel streams: HEAD 104s, patched 25s
> =================================
> using a normal table, wal-level=minimal, 16mb wal segments:
> NVME, 4 parallel streams: HEAD 237s, patched 157s
> NVME, 8 parallel streams: HEAD 200s, patched 142s
> NVME, 16 parallel streams: HEAD 171s, patched 145s
> NVME, 32 parallel streams: HEAD 199s, patched 146s
> tmpfs, 16 parallel streams: HEAD 131s, patched 89s
> tmpfs, 32 parallel streams: HEAD 148s, patched 98s
> =================================
> using a normal table, wal-level=minimal, 256mb wal segments,
> wal_init_zero = off, wal_buffers = 262143, wal_writer_delay = 10000ms,
> wal_writer_flush_after = 512MB
>
> NVME, 4 parallel streams: HEAD 201s, patched 159s
> NVME, 8 parallel streams: HEAD 157s, patched 109s
> NVME, 16 parallel streams: HEAD 150s, patched 78s
> NVME, 32 parallel streams: HEAD 158s, patched 70s
> tmpfs, 16 parallel streams: HEAD 106s, patched 54s
> tmpfs, 32 parallel streams: HEAD 113s, patched 44s
> =================================
>
> Thoughts?
>
> Cheers,
> Luc
> Swarm64
>
>
> [1]
>
> https://www.postgresql.org/message-id/flat/CAJcOf-f%3DUX1uKbPjDXf%2B8gJOoEPz9VCzh7pKnknfU4sG4LXj0A%40mail.gmail.com#49fe9f2ffcc9916cc5ed3a712aa5f28f
> [2]
>
> https://www.postgresql.org/message-id/flat/CALj2ACWjymmyTvvhU20Er-LPLaBjzBQxMJwr4nzO7XWmOkxhsg%40mail.gmail.com#be34b5b7861876fc0fd7edb621c067fa
> [3]
>
> https://www.postgresql.org/message-id/flat/CALj2ACXg-4hNKJC6nFnepRHYT4t5jJVstYvri%2BtKQHy7ydcr8A%40mail.gmail.com
> [4]
>
> https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40mail.gmail.com
> [5]
>
> https://www.postgresql.org/message-id/flat/CALDaNm3GaZyYPpGu-PpF0SEkJg-eaW3TboHxpxJ-2criv2j_eA%40mail.gmail.com#07292ce654ef58fae7f257a4e36afc41
> [6]
>
> https://www.postgresql.org/message-id/flat/20200203132319.x7my43whtefeznz7%40alap3.anarazel.de#85a2a0ab915cdf079862d70505abe3db
> [7]
>
> https://www.postgresql.org/message-id/flat/20201208040227.7rlzpfpdxoau4pvd%40alap3.anarazel.de#b8ea4a3b7f37e88ddfe121c4b3075e7b
> [8]
>
> https://www.postgresql.org/message-id/flat/CAD21AoA9oK1VOoUuKW-jEO%3DY2nt5kCQKKFgeQwwRUMoh6BE-ow%40mail.gmail.com#0475248a5ff7aed735be41fd4034ae36
> [9]
>
> https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHf_R_ih1pkBMTWn%3DSTyKMOM2Ks47Y_UqqfU1wRc1VvA%40mail.gmail.com#7a53ad72331e423ba3c6a50e6dc1259f
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2021-01-01 19:34:34 Move --data-checksums to common options in initdb --help
Previous Message Alastair Turner 2021-01-01 18:26:36 Re: Proposed patch for key management