RE: Recovery performance of standby for multiple concurrent truncates on large tables

From: "Jamison, Kirk" <k(dot)jamison(at)jp(dot)fujitsu(dot)com>
To: 'Thomas Munro' <thomas(dot)munro(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)jp(dot)fujitsu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Recovery performance of standby for multiple concurrent truncates on large tables
Date: 2018-08-03 02:52:35
Message-ID: D09B13F772D2274BB348A310EE3027C6326856@g01jpexmbkw24
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,
I appreciate the feedback and suggestions.

On Tue, Jul 31, 2018 at 8:01 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> How would this work if a relfilenode number that belonged to an old
>> relation got recycled for a new relation?
>> ..
>> I think something like this could be made to work -- both on the
>> master and the standby, and not just while waiting for a failover --
>> ...
>> It's not clear to me whether it would be worth the overhead of doing
>> something like this. Making relation drops faster at the cost of
>> making buffer cleaning slower could be a loser.

The deferred list has the information such as relfilenode and the head/top
page number of invalid pages. The standby in the promote mode only registers
info in the deferred list when redoing the COMMIT record. However, it scans
the shared buffer cache only once before the end of the promote, and discards
the page related to the table included in the deferred list. After removing,
increment the head page number of the abovementioned invalid page.
In this way, if ever while promoting an INSERT progresses (I'm not sure if
it's possible), the discard progresses too, & the app can refer to it.

As mentioned by Tsunakawa-san above, the purpose of the design is to make
the failover process as shorter as possible and not really to make the
drop of relations faster. My initial thought was not to touch the regular
buffer invalidation process, but I am also not sure of the level of
complexity that the proposed design would affect and how crucial the
requirement (shorter failover) would make, so I asked for community's feedback.

On Tue, July 31, 2018 8:27 AM, Thomas Munro wrote:
> Anyway, it's a lot of complexity, and it falls back to a worst cases
> like today, and can also transfer work to innocent foreground processes.
> I see why Andres says we should just get a better data structure so we
> can make the guy doing the dropping pay for it up front, but more
> efficiently. I suspect we may also want an ordered data structure for
> write clustering purposes one day.

I also understand the value of solving the root cause of the problem that's why
I asked Andres if we could expect a development from the community for V12
regarding the radix tree approach for buffer management, or even just from anyone who
could start a WIP patch regardless if it's radix tree or not.
And perhaps we'd like to get involved as well as this is also our customer's problem.

So I am just curious about the radix tree approach's design. Maybe we should
start discussing what kind of data structures, processing, etc. are involved?

I also read other design solutions from another thread [1].
a. Fujii-san's proposal
Add $SUBJECT for performance improvement; reloption to prevent vacuum
from truncating empty pages
b. Pavan's comment
> What if we remember the buffers as seen by count_nondeletable_pages() and
> then just discard those specific buffers instead of scanning the entire
> shared_buffers again? Surely we revisit all to-be-truncated blocks before
> actual truncation. So we already know which buffers to discard. And we're
> holding exclusive lock at that point, so nothing can change underneath. Of
> course, we can't really remember a large number of buffers, so we can do
> this in small chunks. Scan last K blocks, remember those K buffers, discard
> those K buffers, truncate the relation and then try for next K blocks. If
> another backend requests lock on the table, we give up or retry after a
> while.

[1] https://www.postgresql.org/message-id/flat/20180419203802.hqrb4o2wexjnb2ft%40alvherre.pgsql#7d4a8c56a01392a3909b2150371e6495

Now, how do we move forward?

Thank you everyone.

Regards,
Kirk Jamison

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-08-03 03:38:11 Re: [report] memory leaks in COPY FROM on partitioned table
Previous Message Haribabu Kommi 2018-08-03 02:35:50 Re: Pluggable Storage - Andres's take