Re: refactoring relation extension and BufferAlloc(), faster COPY

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY
Date: 2023-02-27 21:45:30
Message-ID: 20230227214530.vfe4gq5oipmsa3zu@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote:
> We also do this in freespace.c and visibilitymap.c:
>
> /* Extend as needed. */
> while (fsm_nblocks_now < fsm_nblocks)
> {
> PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
>
> smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
> pg.data, false);
> fsm_nblocks_now++;
> }
>
> We could use the new smgrzeroextend function here. But it would be better to
> go through the buffer cache, because after this, the last block, at
> 'fsm_nblocks', will be read with ReadBuffer() and modified.

I doubt it's a particularly crucial thing to optimize.

But, uh, isn't this code racy? Because this doesn't go through shared buffers,
there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
that writing pages isn't atomic vs readers. So another connection could
connection could see the new relation size, but a read might return a
partially written state of the page. Which then would cause checksum
failures. And even worse, I think it could lead to loosing a write, if the
concurrent connection writes out a page.

> We could use BulkExtendSharedRelationBuffered() to extend the relation and
> keep the last page locked, but the BulkExtendSharedRelationBuffered()
> signature doesn't allow that. It can return the first N pages locked, but
> there's no way to return the *last* page locked.

We can't rely on bulk extending a, potentially, large number of pages in one
go anyway (since we might not be allowed to pin that many pages). So I don't
think requiring locking the last page is a really viable API.

I think for this case I'd just just use the ExtendRelationTo() API we were
discussing nearby. Compared to the cost of reducing syscalls / filesystem
overhead to extend the relation, the cost of the buffer mapping lookup does't
seem significant. That's different in e.g. the hio.c case, because there we
need a buffer with free space, and concurrent activity could otherwise fill up
the buffer before we can lock it again.

I had started hacking on ExtendRelationTo() that when I saw problems with the
existing code that made me hesitate:
https://postgr.es/m/20230223010147.32oir7sb66slqnjk%40awork3.anarazel.de

> Perhaps we should decompose this function into several function calls.
> Something like:
>
> /* get N victim buffers, pinned and !BM_VALID */
> buffers = BeginExtendRelation(int npages);
>
> LockRelationForExtension(rel)
>
> /* Insert buffers into buffer table */
> first_blk = smgrnblocks()
> for (blk = first_blk; blk < last_blk; blk++)
> MapNewBuffer(blk, buffers[i])
>
> /* extend the file on disk */
> smgrzeroextend();
>
> UnlockRelationForExtension(rel)
>
> for (blk = first_blk; blk < last_blk; blk++)
> {
> memset(BufferGetPage(buffers[i]), 0,
> FinishNewBuffer(buffers[i])
> /* optionally lock the buffer */
> LockBuffer(buffers[i]);
> }
>
> That's a lot more verbose, of course, but gives the callers the flexibility.
> And might even be more readable than one function call with lots of
> arguments.

To me this seems like a quite bad idea. The amount of complexity this would
expose all over the tree is substantial. Which would also make it harder to
further improve relation extension at a later date. It certainly shouldn't be
the default interface. And I'm not sure I see a promisung usecase.

> This would expose the concept of a buffer that's mapped but marked as
> IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing details
> that shouldn't be exposed. On the other hand, it might come handy. Instead
> of RBM_ZERO_AND_LOCK mode, for example, it might be handy to have a function
> that returns an IO-in-progress buffer that you can initialize any way you
> want.

I'd much rather encapsulate that in additional functions, or perhaps a
callback that can make decisions about what to do.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Juan José Santamaría Flecha 2023-02-27 22:05:23 Re: WIN32 pg_import_system_collations
Previous Message Peter Smith 2023-02-27 21:21:38 Re: PGDOCS - sgml linkend using single-quotes