Re: Relation bulk write facility

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation bulk write facility
Date: 2024-01-09 06:50:45
Message-ID: CALDaNm0WYTR4mnvb8nRCWBinFw1kP9x0aSqmdswHj_Nr50YQug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 25 Nov 2023 at 06:49, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> On 19/11/2023 02:04, Andres Freund wrote:
> > On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
> >> The new facility makes it easier to optimize bulk loading, as the
> >> logic for buffering, WAL-logging, and syncing the relation only needs
> >> to be implemented once. It's also less error-prone: We have had a
> >> number of bugs in how a relation is fsync'd - or not - at the end of a
> >> bulk loading operation. By centralizing that logic to one place, we
> >> only need to write it correctly once.
> >
> > One thing I'd like to use the centralized handling for is to track such
> > writes in pg_stat_io. I don't mean as part of the initial patch, just that
> > that's another reason I like the facility.
>
> Oh I didn't realize they're not counted at the moment.
>
> >> + bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
> >> +
> >> nblocks = smgrnblocks(src, forkNum);
> >>
> >> for (blkno = 0; blkno < nblocks; blkno++)
> >> {
> >> + Page page;
> >> +
> >> /* If we got a cancel signal during the copy of the data, quit */
> >> CHECK_FOR_INTERRUPTS();
> >>
> >> - smgrread(src, forkNum, blkno, buf.data);
> >> + page = bulkw_alloc_buf(bulkw);
> >> + smgrread(src, forkNum, blkno, page);
> >>
> >> if (!PageIsVerifiedExtended(page, blkno,
> >> PIV_LOG_WARNING | PIV_REPORT_STAT))
> >> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
> >> * page this is, so we have to log the full page including any unused
> >> * space.
> >> */
> >> - if (use_wal)
> >> - log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
> >> -
> >> - PageSetChecksumInplace(page, blkno);
> >> -
> >> - /*
> >> - * Now write the page. We say skipFsync = true because there's no
> >> - * need for smgr to schedule an fsync for this write; we'll do it
> >> - * ourselves below.
> >> - */
> >> - smgrextend(dst, forkNum, blkno, buf.data, true);
> >> + bulkw_write(bulkw, blkno, page, false);
> >
> > I wonder if bulkw_alloc_buf() is a good name - if you naively read this
> > change, it looks like it'll just leak memory. It also might be taken to be
> > valid until freed, which I don't think is the case?
>
> Yeah, I'm not very happy with this interface. The model is that you get
> a buffer to write to by calling bulkw_alloc_buf(). Later, you hand it
> over to bulkw_write(), which takes ownership of it and frees it later.
> There is no other function to free it, although currently the buffer is
> just palloc'd so you could call pfree on it.
>
> However, I'd like to not expose that detail to the callers. I'm
> imagining that in the future we might optimize further, by having a
> larger e.g. 1 MB buffer, and carve the 8kB blocks from that. Then
> opportunistically, if you fill the buffers sequentially, bulk_write.c
> could do one smgrextend() to write the whole 1 MB chunk.
>
> I renamed it to bulkw_get_buf() now, and made it return a new
> BulkWriteBuffer typedef instead of a plain Page. The point of the new
> typedef is to distinguish a buffer returned by bulkw_get_buf() from a
> Page or char[BLCKSZ] that you might palloc on your own. That indeed
> revealed some latent bugs in gistbuild.c where I had mixed up buffers
> returned by bulkw_alloc_buf() and palloc'd buffers.
>
> (The previous version of this patch called a different struct
> BulkWriteBuffer, but I renamed that to PendingWrite; see below. Don't be
> confused!)
>
> I think this helps a little, but I'm still not very happy with it. I'll
> give it some more thought after sleeping over it, but in the meanwhile,
> I'm all ears for suggestions.
>
> >> +/*-------------------------------------------------------------------------
> >> + *
> >> + * bulk_write.c
> >> + * Efficiently and reliably populate a new relation
> >> + *
> >> + * The assumption is that no other backends access the relation while we are
> >> + * loading it, so we can take some shortcuts. Alternatively, you can use the
> >> + * buffer manager as usual, if performance is not critical, but you must not
> >> + * mix operations through the buffer manager and the bulk loading interface at
> >> + * the same time.
> >
> > From "Alternatively" onward this is is somewhat confusing.
>
> Rewrote that to just "Do not mix operations through the regular buffer
> manager and the bulk loading interface!"
>
> >> + * One tricky point is that because we bypass the buffer manager, we need to
> >> + * register the relation for fsyncing at the next checkpoint ourselves, and
> >> + * make sure that the relation is correctly fsync by us or the checkpointer
> >> + * even if a checkpoint happens concurrently.
> >
> > "fsync'ed" or such? Otherwise this reads awkwardly for me.
>
> Yep, fixed.
>
> >> +typedef struct BulkWriteBuffer
> >> +{
> >> + Page page;
> >> + BlockNumber blkno;
> >> + bool page_std;
> >> + int16 order;
> >> +} BulkWriteBuffer;
> >> +
> >
> > The name makes it sound like this struct itself contains a buffer - but it's
> > just pointing to one. *BufferRef or such maybe?
> >
> > I was wondering how you dealt with the alignment of buffers given the struct
> > definition, which is what lead me to look at this...
>
> I renamed this to PendingWrite, and the field that holds these
> "pending_writes". Think of it as a queue of writes that haven't been
> performed yet.
>
> >> +/*
> >> + * Bulk writer state for one relation fork.
> >> + */
> >> +typedef struct BulkWriteState
> >> +{
> >> + /* Information about the target relation we're writing */
> >> + SMgrRelation smgr;
> >
> > Isn't there a danger of this becoming a dangling pointer? At least until
> > https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
> > is merged?
>
> Good point. I just added a FIXME comment to remind about that, hoping
> that that patch gets merged soon. If not, I'll come up with a different fix.
>
> >> + ForkNumber forknum;
> >> + bool use_wal;
> >> +
> >> + /* We keep several pages buffered, and WAL-log them in batches */
> >> + int nbuffered;
> >> + BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
> >> +
> >> + /* Current size of the relation */
> >> + BlockNumber pages_written;
> >> +
> >> + /* The RedoRecPtr at the time that the bulk operation started */
> >> + XLogRecPtr start_RedoRecPtr;
> >> +
> >> + Page zeropage; /* workspace for filling zeroes */
> >
> > We really should just have one such page in shared memory somewhere... For WAL
> > writes as well.
> >
> > But until then - why do you allocate the page? Seems like we could just use a
> > static global variable?
>
> I made it a static global variable for now. (The palloc way was copied
> over from nbtsort.c)
>
> >> +/*
> >> + * Write all buffered pages to disk.
> >> + */
> >> +static void
> >> +bulkw_flush(BulkWriteState *bulkw)
> >> +{
> >> + int nbuffered = bulkw->nbuffered;
> >> + BulkWriteBuffer *buffers = bulkw->buffers;
> >> +
> >> + if (nbuffered == 0)
> >> + return;
> >> +
> >> + if (nbuffered > 1)
> >> + {
> >> + int o;
> >> +
> >> + qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
> >> +
> >> + /*
> >> + * Eliminate duplicates, keeping the last write of each block.
> >> + * (buffer_cmp uses 'order' as the last sort key)
> >> + */
> >
> > Huh - which use cases would actually cause duplicate writes?
>
> Hmm, nothing anymore I guess. Many AMs used to write zero pages as a
> placeholder and come back to fill them in later, but now that
> bulk_write.c handles that,
>
> Removed that, and replaced it with with an assertion in buffer_cmp()
> that there are no duplicates.

There are few compilation errors reported by CFBot at [1], patch needs
to be rebased:
[02:38:12.675] In file included from ../../../../src/include/postgres.h:45,
[02:38:12.675] from nbtsort.c:41:
[02:38:12.675] nbtsort.c: In function ‘_bt_load’:
[02:38:12.675] nbtsort.c:1309:57: error: ‘BTPageState’ has no member
named ‘btps_page’
[02:38:12.675] 1309 | Assert(dstate->maxpostingsize <=
BTMaxItemSize(state->btps_page) &&
[02:38:12.675] | ^~
[02:38:12.675] ../../../../src/include/c.h:864:9: note: in definition
of macro ‘Assert’
[02:38:12.675] 864 | if (!(condition)) \
[02:38:12.675] | ^~~~~~~~~
[02:38:12.675] ../../../../src/include/c.h:812:29: note: in expansion
of macro ‘TYPEALIGN_DOWN’
[02:38:12.675] 812 | #define MAXALIGN_DOWN(LEN)
TYPEALIGN_DOWN(MAXIMUM_ALIGNOF, (LEN))
[02:38:12.675] | ^~~~~~~~~~~~~~
[02:38:12.675] ../../../../src/include/access/nbtree.h:165:3: note: in
expansion of macro ‘MAXALIGN_DOWN’
[02:38:12.675] 165 | (MAXALIGN_DOWN((PageGetPageSize(page) - \

[1] - https://cirrus-ci.com/task/5299954164432896

Regards,
Vignesh

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-01-09 07:01:41 RE: speed up a logical replica setup
Previous Message vignesh C 2024-01-09 06:46:34 Re: SQL:2011 application time