Re: Relation bulk write facility

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation bulk write facility
Date: 2023-11-24 23:19:16
Message-ID: 93e91b02-e07f-4205-8de6-3942b91760cb@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19/11/2023 02:04, Andres Freund wrote:
> On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
>> The new facility makes it easier to optimize bulk loading, as the
>> logic for buffering, WAL-logging, and syncing the relation only needs
>> to be implemented once. It's also less error-prone: We have had a
>> number of bugs in how a relation is fsync'd - or not - at the end of a
>> bulk loading operation. By centralizing that logic to one place, we
>> only need to write it correctly once.
>
> One thing I'd like to use the centralized handling for is to track such
> writes in pg_stat_io. I don't mean as part of the initial patch, just that
> that's another reason I like the facility.

Oh I didn't realize they're not counted at the moment.

>> + bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
>> +
>> nblocks = smgrnblocks(src, forkNum);
>>
>> for (blkno = 0; blkno < nblocks; blkno++)
>> {
>> + Page page;
>> +
>> /* If we got a cancel signal during the copy of the data, quit */
>> CHECK_FOR_INTERRUPTS();
>>
>> - smgrread(src, forkNum, blkno, buf.data);
>> + page = bulkw_alloc_buf(bulkw);
>> + smgrread(src, forkNum, blkno, page);
>>
>> if (!PageIsVerifiedExtended(page, blkno,
>> PIV_LOG_WARNING | PIV_REPORT_STAT))
>> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
>> * page this is, so we have to log the full page including any unused
>> * space.
>> */
>> - if (use_wal)
>> - log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
>> -
>> - PageSetChecksumInplace(page, blkno);
>> -
>> - /*
>> - * Now write the page. We say skipFsync = true because there's no
>> - * need for smgr to schedule an fsync for this write; we'll do it
>> - * ourselves below.
>> - */
>> - smgrextend(dst, forkNum, blkno, buf.data, true);
>> + bulkw_write(bulkw, blkno, page, false);
>
> I wonder if bulkw_alloc_buf() is a good name - if you naively read this
> change, it looks like it'll just leak memory. It also might be taken to be
> valid until freed, which I don't think is the case?

Yeah, I'm not very happy with this interface. The model is that you get
a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
over to bulkw_write(), which takes ownership of it and frees it later.
There is no other function to free it, although currently the buffer is
just palloc'd so you could call pfree on it.

However, I'd like to not expose that detail to the callers. I'm
imagining that in the future we might optimize further, by having a
larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then
opportunistically, if you fill the buffers sequentially, bulk_write.c
could do one smgrextend() to write the whole 1 MB chunk.

I renamed it to bulkw_get_buf() now, and made it return a new
BulkWriteBuffer typedef instead of a plain Page. The point of the new
typedef is to distinguish a buffer returned by bulkw_get_buf() from a
Page or char[BLCKSZ] that you might palloc on your own. That indeed
revealed some latent bugs in gistbuild.c where I had mixed up buffers
returned by bulkw_alloc_buf() and palloc'd buffers.

(The previous version of this patch called a different struct
BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be
confused!)

I think this helps a little, but I'm still not very happy with it. I'll
give it some more thought after sleeping over it, but in the meanwhile,
I'm all ears for suggestions.

>> +/*-------------------------------------------------------------------------
>> + *
>> + * bulk_write.c
>> + * Efficiently and reliably populate a new relation
>> + *
>> + * The assumption is that no other backends access the relation while we are
>> + * loading it, so we can take some shortcuts. Alternatively, you can use the
>> + * buffer manager as usual, if performance is not critical, but you must not
>> + * mix operations through the buffer manager and the bulk loading interface at
>> + * the same time.
>
> From "Alternatively" onward this is is somewhat confusing.

Rewrote that to just "Do not mix operations through the regular buffer
manager and the bulk loading interface!"

>> + * One tricky point is that because we bypass the buffer manager, we need to
>> + * register the relation for fsyncing at the next checkpoint ourselves, and
>> + * make sure that the relation is correctly fsync by us or the checkpointer
>> + * even if a checkpoint happens concurrently.
>
> "fsync'ed" or such? Otherwise this reads awkwardly for me.

Yep, fixed.

>> +typedef struct BulkWriteBuffer
>> +{
>> + Page page;
>> + BlockNumber blkno;
>> + bool page_std;
>> + int16 order;
>> +} BulkWriteBuffer;
>> +
>
> The name makes it sound like this struct itself contains a buffer - but it's
> just pointing to one. *BufferRef or such maybe?
>
> I was wondering how you dealt with the alignment of buffers given the struct
> definition, which is what lead me to look at this...

I renamed this to PendingWrite, and the field that holds these
"pending_writes". Think of it as a queue of writes that haven't been
performed yet.

>> +/*
>> + * Bulk writer state for one relation fork.
>> + */
>> +typedef struct BulkWriteState
>> +{
>> + /* Information about the target relation we're writing */
>> + SMgrRelation smgr;
>
> Isn't there a danger of this becoming a dangling pointer? At least until
> https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
> is merged?

Good point. I just added a FIXME comment to remind about that, hoping
that that patch gets merged soon. If not, I'll come up with a different fix.

>> + ForkNumber forknum;
>> + bool use_wal;
>> +
>> + /* We keep several pages buffered, and WAL-log them in batches */
>> + int nbuffered;
>> + BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
>> +
>> + /* Current size of the relation */
>> + BlockNumber pages_written;
>> +
>> + /* The RedoRecPtr at the time that the bulk operation started */
>> + XLogRecPtr start_RedoRecPtr;
>> +
>> + Page zeropage; /* workspace for filling zeroes */
>
> We really should just have one such page in shared memory somewhere... For WAL
> writes as well.
>
> But until then - why do you allocate the page? Seems like we could just use a
> static global variable?

I made it a static global variable for now. (The palloc way was copied
over from nbtsort.c)

>> +/*
>> + * Write all buffered pages to disk.
>> + */
>> +static void
>> +bulkw_flush(BulkWriteState *bulkw)
>> +{
>> + int nbuffered = bulkw->nbuffered;
>> + BulkWriteBuffer *buffers = bulkw->buffers;
>> +
>> + if (nbuffered == 0)
>> + return;
>> +
>> + if (nbuffered > 1)
>> + {
>> + int o;
>> +
>> + qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
>> +
>> + /*
>> + * Eliminate duplicates, keeping the last write of each block.
>> + * (buffer_cmp uses 'order' as the last sort key)
>> + */
>
> Huh - which use cases would actually cause duplicate writes?

Hmm, nothing anymore I guess. Many AMs used to write zero pages as a
placeholder and come back to fill them in later, but now that
bulk_write.c handles that,

Removed that, and replaced it with with an assertion in buffer_cmp()
that there are no duplicates.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
v3-0001-Introduce-a-new-bulk-loading-facility.patch text/x-patch 53.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-11-25 00:44:25 Re: pg_walfile_name_offset can return inconsistent values
Previous Message Alexander Korotkov 2023-11-24 23:06:24 Re: pg_stats and range statistics