Re: Online checksums verification in the backend

From: Andres Freund <andres(at)anarazel(dot)de>
To: Julien Rouhaud <rjuju123(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Masahiko Sawada <masahiko(dot)sawada(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Subject: Re: Online checksums verification in the backend
Date: 2020-10-30 02:58:34
Message-ID: 20201030025834.6spoioldj3jiqif3@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote:
> On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > The code does IO while holding the buffer mapping lock. That seems
> > *entirely* unacceptable to me. That basically locks 1/128 of shared
> > buffers against concurrent mapping changes, while reading data that is
> > likely not to be on disk? Seriously?
>
> The initial implementation had a different approach, reading the buffer once
> without holding the buffer mapping lock (which could lead to some false
> positive in some unlikely scenario), and only if a corruption is detected the
> read is done once again *while holding the buffer mapping lock* to ensure it's
> not a false positive. Some benchmarking showed that the performance was worse,
> so we dropped that optimisation. Should we go back to something like that or
> do you have a better way to ensure a consistent read of a buffer which isn't in
> shared buffers?

I suspect that you're gonna need something quite different than what the
function is doing right now. Not because such a method will be faster in
isolation, but because there's a chance to have it correct and not have
a significant performance impact onto the rest of the system.

I've not thought about it in detail yet. Is suspect you'll need to
ensure there is a valid entry in the buffer mapping table for the buffer
you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry
you're going to prevent concurrent IO from starting until your part is
done.

> > Also, why are pages without a valid tag ignored? I can follow the
> > argument for skipping it in the DIRTY case, but that doesn't apply for
> > BM_TAG_VALID?
>
> AFAICT pages that aren't BM_TAG_VALID are pages newly allocated.
> Those shouldn't
> be entirely initialized yet, and they'll be eventually written and flushed.

When a page is being read there's a period when the buffer is without
BM_TAG_VALID. It's quite possible that the locking prevents this case
from being reachable - but in that case you shouldn't just accept it as
something to be skipped...

> > Also, uh, I don't think the locking of the buffer table provides you
> > with the full guarantees CheckBuffer() seems to assume:
> >
> > * Check the state of a buffer without loading it into the shared buffers. To
> > * avoid torn pages and possible false positives when reading data, a shared
> > * LWLock is taken on the target buffer pool partition mapping, and we check
> > * if the page is in shared buffers or not. An I/O lock is taken on the block
> > * to prevent any concurrent activity from happening.
> >
> > this doesn't actually prevent all concurrent write IO, unless you hold
> > an appropriate lock on the relation. There's a few places that use
> > smgrwrite()/smgrextend() to write out data bypassing shared buffers.
> >
> > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
> > for, but that'd need a pretty detailed explanation as to when it's safe
> > to use CheckBuffer() for which blocks.
>
> AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged
> relation, during creation.

That may be the case right in core right now, but for one, there
definitely are extensions going through smgrwrite() without using the
buffer pool. Essentially, what you are saying is that the introduction
of CheckBuffer() altered what smgrwrite() is allowed to be used for,
without having discussed or documented that.

Before this an AM/extension could just use smgrwrite() to write data not
in shared buffers, as long as a locking scheme is used that prevents
multiple backends from doing that at the same time (trivially:
AccessExclusiveLock).

> Those relations shouldn't be visible to the caller
> snapshot, so it should be safe. I can add a comment for that if I'm not
> mistaken.

There's no comment warning that you shouldn't use CheckBuffer() to check
every buffer in shared buffers, or every relfilenode on disk. The latter
would be quite a reasonable thing, given it'd avoid needing to connect
to every database etc.

> For concurrent smgrextend(), we read the relation size at the beginning of the
> function, so we shouldn't read newly allocated blocks. But you're right that
> it's still possible to get the size that includes a newly allocated block
> that can be concurrently written. We can avoid that be holding a
> LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be ok?

That could possibly work - but currently CheckBuffer() doesn't get a
relation, nor are the comments explaining that it has to be a relation
in the current database or anything.

I hadn't yet looked at the caller - I just started looking at
CheckBuffer() this because it caused compilation failures after rebasing
my aio branch onto master (there's no IO locks anymore).

Looking at the caller:
- This is not a concurrency safe pattern:

/* Check if relation exists. leaving if there is no such relation */
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
return;

relation = relation_open(relid, AccessShareLock);

there's a pretty obvious time-to-check-time-to-use danger here.

- pg_relation_check_pages()'s docs say "valid enough to safely be loaded
into the server's shared buffers". I think that's overpromising by a
lot. It sounds like it verifies that the page cannot cause a crash or
such when accessed - but it obviously does no such thing.

- Why does check_one_relation() *silently* ignore when it's being
passed a temporary table, or a relkind without storage?

- I don't think it's good that check_one_relation() releases relation
locks after access, but I know that others think that's fine (I think
it's only fine for catalog relations).

- I realize permission to pg_relation_check_pages() is not granted to
non-superusers by default, but shouldn't it still perform relation
access checks?

- why does check_relation_fork() pstrdup the path?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2020-10-30 03:00:27 Re: Add statistics to pg_stat_wal view for wal related parameter tuning
Previous Message Fujii Masao 2020-10-30 02:50:59 Re: Add statistics to pg_stat_wal view for wal related parameter tuning