Re: refactoring relation extension and BufferAlloc(), faster COPY

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: refactoring relation extension and BufferAlloc(), faster COPY
Date: 2023-03-01 17:25:03
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 2023-03-01 09:02:00 -0800, Andres Freund wrote:
> On 2023-03-01 11:12:35 +0200, Heikki Linnakangas wrote:
> > On 27/02/2023 23:45, Andres Freund wrote:
> > > But, uh, isn't this code racy? Because this doesn't go through shared buffers,
> > > there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
> > > that writing pages isn't atomic vs readers. So another connection could
> > > connection could see the new relation size, but a read might return a
> > > partially written state of the page. Which then would cause checksum
> > > failures. And even worse, I think it could lead to loosing a write, if the
> > > concurrent connection writes out a page.
> >
> > fsm_readbuf and vm_readbuf check the relation size first, with
> > smgrnblocks(), before trying to read the page. So to have a problem, the
> > smgrnblocks() would have to already return the new size, but the smgrread()
> > would not return the new contents. I don't think that's possible, but not
> > sure.
> I hacked Thomas' program to test torn reads to ftruncate the file on the write
> side.
> It frequently observes a file size that's not the write size (e.g. reading 4k
> when writing an 8k block).
> After extending the test to more than one reader, I indeed also see torn
> reads. So far all the tears have been at a 4k block boundary. However so far
> it always has been *prior* page contents, not 0s.

On tmpfs the failure rate is much higher, and we also end up reading 0s,
despite never writing them.

I've attached my version of the test program.

ext4: lots of 4k reads with 8k writes, some torn reads at 4k boundaries
xfs: no issues
tmpfs: loads of 4k reads with 8k writes, lots torn reads reading 0s, some torn reads at 4k boundaries


Andres Freund

Attachment Content-Type Size
concurrent-read-write.c text/x-csrc 2.1 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-03-01 17:26:37 Re: add PROCESS_MAIN to VACUUM
Previous Message Justin Pryzby 2023-03-01 17:08:17 Re: Add LZ4 compression in pg_dump