Re: Online checksums verification in the backend

From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Julien Rouhaud <rjuju123(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Online checksums verification in the backend
Date: 2020-11-02 01:39:40
Message-ID: 20201102013940.tk5e2co2maedzksw@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'm a bit limited writing - one handed for a while following an injury
on Friday...

On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:
> On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote:
> > I think its pretty much *never* OK to do IO while holding a buffer
> > mapping lock. You're locking a significant fraction of shared buffers
> > over IO. That's just not OK. Don't think there's any place doing so
> > currently either.
>
> There is no place doing that on HEAD.

Err?

/* see if the block is in the buffer pool or not */
LWLockAcquire(partLock, LW_SHARED);
buf_id = BufTableLookup(&buf_tag, buf_hash);
if (buf_id >= 0)
{
...
/* Read the buffer from disk, with the I/O lock still held */
smgrread(smgr, forknum, blkno, buffer);
LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
}
else
{
/*
* Simply read the buffer. There's no risk of modification on it as
* we are holding the buffer pool partition mapping lock.
*/
smgrread(smgr, forknum, blkno, buffer);
}

/* buffer lookup done, so now do its check */
LWLockRelease(partLock);

How is this not doing IO while holding a buffer mapping lock?

> This specific point was mentioned in the first message of this thread,
> 7th paragraph. That was a long thread, so it is easy to miss:
> https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

The code clearly doesnt implement it that way.

> I am wondering what you have in mind regarding the use of
> BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some
> consequences for other existing buffers in the table, like a possible
> eviction?

You'd need exactly one empty buffer for that - it can be reused for the
next to-be-checked buffer.

> I'd like to think that we should not do any manipulation of
> the buffer tables in this case.

Why? Its the way we lock buffers - why is this so special that we need
to do differently?

> Hence, in order to prevent a
> concurrent activity to load in shared buffers the page currently
> checked on disk, I got to think that we would need something new here,
> like a filtering hash table that would be checked each time a backend
> tries to insert an entry into the buffer tables.

Thats going to slow down everything a bit - the mapping already is a
bottleneck.

> 1) If the buffer is in shared buffers, we have the APIs to solve that
> by using a pin, unlock the partition, and then do the I/O. (Still
> that's unsafe with the smgrwrite() argument?)

Thats why you need an appropriate relation lock... Something CheckBuffer
didnt bother to document. Its a restriction, but one we probably can
live with.

> 2) If the buffer is not in shared buffers, we don't have what it takes
> to solve the problem yet.

We do. Set up enough state for the case to be otherwise the same as the
in s_b case.

> But even if we solve this problem, we will
> never really be sure that this is entirely safe, as per the argument
> with concurrent smgrwrite() calls. Current in-core code assumes that
> this can happen only for init forks of unlogged relations which would
> not be visible yet in the backend doing a page check, still it can be
> really easy to break this assumption with any new code added by a new
> feature.

It also happens in a few other cases than just init forks. But
visibility & relation locking can take care of that. But you need to
document that. If the locking allows concurent readers - and especially
concurrent writers, then you cant really use smgrwite for anything but
relation extension.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2020-11-02 01:45:00 Re: Online checksums verification in the backend
Previous Message Kyotaro Horiguchi 2020-11-02 01:36:10 Re: Explicit NULL dereference (src/backend/utils/adt/ruleutils.c)