Re: Relation bulk write facility

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation bulk write facility
Date: 2023-11-19 00:04:16
Message-ID: 20231119000416.fuyrbo6av7dytmf6@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-17 11:37:21 +0100, Heikki Linnakangas wrote:
> The new facility makes it easier to optimize bulk loading, as the
> logic for buffering, WAL-logging, and syncing the relation only needs
> to be implemented once. It's also less error-prone: We have had a
> number of bugs in how a relation is fsync'd - or not - at the end of a
> bulk loading operation. By centralizing that logic to one place, we
> only need to write it correctly once.

One thing I'd like to use the centralized handling for is to track such
writes in pg_stat_io. I don't mean as part of the initial patch, just that
that's another reason I like the facility.

> The new facility is faster for small relations: Instead of of calling
> smgrimmedsync(), we register the fsync to happen at next checkpoint,
> which avoids the fsync latency. That can make a big difference if you
> are e.g. restoring a schema-only dump with lots of relations.

I think this would be a huge win for people running their application tests
against postgres.

> + bulkw = bulkw_start_smgr(dst, forkNum, use_wal);
> +
> nblocks = smgrnblocks(src, forkNum);
>
> for (blkno = 0; blkno < nblocks; blkno++)
> {
> + Page page;
> +
> /* If we got a cancel signal during the copy of the data, quit */
> CHECK_FOR_INTERRUPTS();
>
> - smgrread(src, forkNum, blkno, buf.data);
> + page = bulkw_alloc_buf(bulkw);
> + smgrread(src, forkNum, blkno, page);
>
> if (!PageIsVerifiedExtended(page, blkno,
> PIV_LOG_WARNING | PIV_REPORT_STAT))
> @@ -511,30 +514,9 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
> * page this is, so we have to log the full page including any unused
> * space.
> */
> - if (use_wal)
> - log_newpage(&dst->smgr_rlocator.locator, forkNum, blkno, page, false);
> -
> - PageSetChecksumInplace(page, blkno);
> -
> - /*
> - * Now write the page. We say skipFsync = true because there's no
> - * need for smgr to schedule an fsync for this write; we'll do it
> - * ourselves below.
> - */
> - smgrextend(dst, forkNum, blkno, buf.data, true);
> + bulkw_write(bulkw, blkno, page, false);

I wonder if bulkw_alloc_buf() is a good name - if you naively read this
change, it looks like it'll just leak memory. It also might be taken to be
valid until freed, which I don't think is the case?

> +/*-------------------------------------------------------------------------
> + *
> + * bulk_write.c
> + * Efficiently and reliably populate a new relation
> + *
> + * The assumption is that no other backends access the relation while we are
> + * loading it, so we can take some shortcuts. Alternatively, you can use the
> + * buffer manager as usual, if performance is not critical, but you must not
> + * mix operations through the buffer manager and the bulk loading interface at
> + * the same time.

From "Alternatively" onward this is is somewhat confusing.

> + * We bypass the buffer manager to avoid the locking overhead, and call
> + * smgrextend() directly. A downside is that the pages will need to be
> + * re-read into shared buffers on first use after the build finishes. That's
> + * usually a good tradeoff for large relations, and for small relations, the
> + * overhead isn't very significant compared to creating the relation in the
> + * first place.

FWIW, I doubt the "isn't very significant" bit is really true.

> + * One tricky point is that because we bypass the buffer manager, we need to
> + * register the relation for fsyncing at the next checkpoint ourselves, and
> + * make sure that the relation is correctly fsync by us or the checkpointer
> + * even if a checkpoint happens concurrently.

"fsync'ed" or such? Otherwise this reads awkwardly for me.

> + *
> + *
> + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + *
> + * IDENTIFICATION
> + * src/backend/storage/smgr/bulk_write.c
> + *
> + *-------------------------------------------------------------------------
> + */
> +#include "postgres.h"
> +
> +#include "access/xloginsert.h"
> +#include "access/xlogrecord.h"
> +#include "storage/bufmgr.h"
> +#include "storage/bufpage.h"
> +#include "storage/bulk_write.h"
> +#include "storage/proc.h"
> +#include "storage/smgr.h"
> +#include "utils/rel.h"
> +
> +#define MAX_BUFFERED_PAGES XLR_MAX_BLOCK_ID
> +
> +typedef struct BulkWriteBuffer
> +{
> + Page page;
> + BlockNumber blkno;
> + bool page_std;
> + int16 order;
> +} BulkWriteBuffer;
> +

The name makes it sound like this struct itself contains a buffer - but it's
just pointing to one. *BufferRef or such maybe?

I was wondering how you dealt with the alignment of buffers given the struct
definition, which is what lead me to look at this...

> +/*
> + * Bulk writer state for one relation fork.
> + */
> +typedef struct BulkWriteState
> +{
> + /* Information about the target relation we're writing */
> + SMgrRelation smgr;

Isn't there a danger of this becoming a dangling pointer? At least until
https://postgr.es/m/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA%40mail.gmail.com
is merged?

> + ForkNumber forknum;
> + bool use_wal;
> +
> + /* We keep several pages buffered, and WAL-log them in batches */
> + int nbuffered;
> + BulkWriteBuffer buffers[MAX_BUFFERED_PAGES];
> +
> + /* Current size of the relation */
> + BlockNumber pages_written;
> +
> + /* The RedoRecPtr at the time that the bulk operation started */
> + XLogRecPtr start_RedoRecPtr;
> +
> + Page zeropage; /* workspace for filling zeroes */

We really should just have one such page in shared memory somewhere... For WAL
writes as well.

But until then - why do you allocate the page? Seems like we could just use a
static global variable?

> +/*
> + * Write all buffered pages to disk.
> + */
> +static void
> +bulkw_flush(BulkWriteState *bulkw)
> +{
> + int nbuffered = bulkw->nbuffered;
> + BulkWriteBuffer *buffers = bulkw->buffers;
> +
> + if (nbuffered == 0)
> + return;
> +
> + if (nbuffered > 1)
> + {
> + int o;
> +
> + qsort(buffers, nbuffered, sizeof(BulkWriteBuffer), buffer_cmp);
> +
> + /*
> + * Eliminate duplicates, keeping the last write of each block.
> + * (buffer_cmp uses 'order' as the last sort key)
> + */

Huh - which use cases would actually cause duplicate writes?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2023-11-19 00:32:36 Re: reindexing an invalid index should not use ERRCODE_INDEX_CORRUPTED
Previous Message Alena Rybakina 2023-11-18 23:57:34 Re: [PoC] Reducing planning time when tables have many partitions