Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>, Ivan Kartyshov <i(dot)kartyshov(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept
Date: 2018-08-21 13:10:44
Message-ID: CAPpHfdtD3U2DpGZQJNe21s9s1s-Va7NRNcP1isvdCuJxzYypcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 17, 2018 at 9:55 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> But then you are injecting bad pages into the shared buffer arena.
> In any case, depending on that behavior seems like a bad idea, because
> it's a pretty questionable kluge in itself.
>
> Another point is that the truncation code attempts to remove all
> to-be-truncated-away pages from the shared buffer arena, but that only
> works if nobody else is loading such pages into shared buffers
> concurrently. In the presence of concurrent scans, we might be left
> with valid-looking buffers for pages that have been truncated away
> on-disk. That could cause all sorts of fun later. Yeah, the buffers
> should contain only dead tuples ... but, for example, they might not
> be hinted dead. If somebody sets one of those hint bits and then
> writes the buffer back out to disk, you've got real problems.

I didn't yet get how that's possible... count_nondeletable_pages()
ensures that every to-be-truncated-away page doesn't contain any used
item. From reading [1] I got that we might have even live tuples if
truncation was failed. But if truncation wasn't failed, we shouldn't
get this hint problem. Please, correct me if I'm wrong.

After reading [1] and [2] I got that there are at least 3 different
issues with heap truncation:
1) Data corruption on file truncation error (explained in [1]).
2) Expensive scanning of the whole shared buffers before file truncation.
3) Cancel of read-only queries on standby even if hot_standby_feedback
is on, caused by replication of AccessExclusiveLock.

It seems that fixing any of those issues requires redesign of heap
truncation. So, ideally redesign of heap truncation should fix all
the issues of above. Or at least it should be understood how the rest
of issues can be fixed later using the new design.

I would like to share some my sketchy thoughts about new heap
truncation design. Let's imagine we introduced dirty_barrier buffer
flag, which prevents dirty buffer from being written (and
correspondingly evicted). Then truncation algorithm could look like
this:

1) Acquire ExclusiveLock on relation.
2) Calculate truncation point using count_nondeletable_pages(), while
simultaneously placing dirty_barrier flag on dirty buffers and saving
their numbers to array. Assuming no writes are performing
concurrently, no to-be-truncated-away pages should be written from
this point.
3) Truncate data files.
4) Iterate past truncation point buffers and clean dirty and
dirty_barrier flags from them (using numbers we saved to array on step
#2).
5) Release relation lock.
*) On exception happen after step #2, iterate past truncation point
buffers and clean dirty_barrier flags from them (using numbers we
saved to array on step #2)

After heap truncation using this algorithm, shared buffers may contain
past-OEF buffers. But those buffers are empty (no used items) and
clean. So, real-only queries shouldn't hint those buffers dirty
because there are no used items. Normally, these buffers will be just
evicted away from the shared buffer arena. If relation extension will
happen short after heap truncation then some of those buffers could be
found after relation extension. I think this situation could be
handled. For instance, we can teach vacuum to claim page as new once
all the tuples were gone.

We're taking only exclusive lock here. And assuming we will teach our
scans to treat page-past-OEF situation as no-visible-tuples-found,
concurrent read-only queries will work concurrently with heap
truncate. Also we don't have to scan whole shared buffers, only past
truncation point buffers are scanned at step #2. Later flags are
cleaned only from truncation point dirty buffers. Data corruption on
truncation error also shouldn't happen as well, because we don't
forget to write any dirty buffers before insure that data files were
successfully truncated.

The problem I see with this approach so far is placing too many
dirty_barrier flags can affect concurrent activity. In order to cope
that we may, for instance, truncate relation in multiple iterations
when we find too many past truncation point dirty buffers.

Any thoughts?

Links.
1. https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC%40tunaPC
2. https://www.postgresql.org/message-id/flat/CAHGQGwE5UqFqSq1%3DkV3QtTUtXphTdyHA-8rAj4A%3DY%2Be4kyp3BQ%40mail.gmail.com

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shawn Debnath 2018-08-21 13:53:21 Re: Proposal: SLRU to Buffer Cache
Previous Message Andres Freund 2018-08-21 12:50:25 Re: FETCH FIRST clause PERCENT option