Re: [HACKERS] GUC for cleanup indexes threshold.

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>, Peter Geoghegan <pg(at)bowt(dot)ie>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "Ideriha, Takeshi" <ideriha(dot)takeshi(at)jp(dot)fujitsu(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kuntal Ghosh <kuntalghosh(dot)2007(at)gmail(dot)com>, pgsql-hackers-owner(at)postgresql(dot)org
Subject: Re: [HACKERS] GUC for cleanup indexes threshold.
Date: 2018-03-19 11:50:48
Message-ID: CAD21AoCfWXcX-po8Q1r779nyVGzs01pwpSLM=u7Sx3Hv+L+4gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 19, 2018 at 2:45 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> At Mon, 19 Mar 2018 11:12:58 +0900, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote in <CAD21AoAB8tQg9xwojupUJjKD=fMhtx6thDEPENDdhftVLWcR8A(at)mail(dot)gmail(dot)com>
>> On Wed, Mar 14, 2018 at 9:25 PM, Alexander Korotkov
>> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>> > On Wed, Mar 14, 2018 at 7:40 AM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
>> > wrote:
>> >>
>> >> On Sat, Mar 10, 2018 at 3:40 AM, Alexander Korotkov
>> >> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>> >> > On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
>> >> > wrote:
>> >> >>
>> >> >> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>> >> >> <a(dot)korotkov(at)postgrespro(dot)ru> wrote:
>> >> >> > 2) These parameters are reset during btbulkdelete() and set during
>> >> >> > btvacuumcleanup().
>> >> >>
>> >> >> Can't we set these parameters even during btbulkdelete()? By keeping
>> >> >> them up to date, we will able to avoid an unnecessary cleanup vacuums
>> >> >> even after index bulk-delete.
>> >> >
>> >> >
>> >> > We certainly can update cleanup-related parameters during
>> >> > btbulkdelete().
>> >> > However, in this case we would update B-tree meta-page during each
>> >> > VACUUM cycle. That may cause some overhead for non append-only
>> >> > workloads. I don't think this overhead would be sensible, because in
>> >> > non append-only scenarios VACUUM typically writes much more of
>> >> > information.
>> >> > But I would like this oriented to append-only workload patch to be
>> >> > as harmless as possible for other workloads.
>> >>
>> >> What overhead are you referring here? I guess the overhead is only the
>> >> calculating the oldest btpo.xact. And I think it would be harmless.
>> >
>> >
>> > I meant overhead of setting last_cleanup_num_heap_tuples after every
>> > btbulkdelete with wal-logging of meta-page. I bet it also would be
>> > harmless, but I think that needs some testing.
>>
>> Agreed.
>>
>> After more thought, it might be too late but we can consider the
>> possibility of another idea proposed by Peter. Attached patch
>> addresses the original issue of index cleanups by storing the epoch
>> number of page deletion XID into PageHeader->pd_prune_xid which is
>> 4byte field.
>
> Mmm. It seems to me that the story is returning to the
> beginning. Could I try retelling the story?
>
> I understant that the initial problem was vacuum runs apparently
> unnecessary full-scan on indexes many times. The reason for that
> is the fact that a cleanup scan may leave some (or many under
> certain condition) dead pages not-recycled but we don't know
> whether a cleanup is needed or not. They will be staying left
> forever unless we run additional cleanup-scans at the appropriate
> timing.
>
> (If I understand it correctly,) Sawada-san's latest proposal is
> (fundamentally the same to the first one,) just skipping the
> cleanup scan if the vacuum scan just before found that the number
> of *live* tuples are increased. If there where many deletions and
> insertions but no increase of total number of tuples, we don't
> have a cleanup. Consequently it had a wraparound problem and it
> is addressed in this version.

No, it doesn't have a wraparound problem. The patch based on Peter's
idea I proposed adds an epoch number of page deletion xid and compare
them when we judge whether the page is recyclable or not. It's
something like we create 64-bit xid of deletion xid. Also, if there is
even one deletion the bulk-delete will be performed instead of the
index cleanup. So with this patch we do the index cleanup only when
the reltuple of table is increased by fraction of
vacuum_index_cleanup_scale_factor from previous stats. It doesn't need
to do the index cleanup by any xid thresholds.

> (ditto.) Alexander proposed to record the oldest xid of
> recyclable pages in metapage (and the number of tuples at the
> last cleanup). This prevents needless cleanup scan and surely
> runs cleanups to remove all recyclable pages.

Yes, but the concerns we discussed are that we need extra WAL-logging
for updating the metapage and it works only for append-only case. If
we also want to support more cases we will need to update the metapage
during bulk-delete. The overhead of WAL-logging would be harmless but
should be tested as Alexander mentioned.

>
> I think that we can accept Sawada-san's proposal if we accept the
> fact that indexes can retain recyclable pages for a long
> time. (Honestly I don't think so.)
>
> If (as I might have mentioned as the same upthread for Yura's
> patch,) we accept to hold the information on index meta page,
> Alexander's way would be preferable. The difference betwen Yura's
> and Alexander's is the former runs cleanup scan if a recyclable
> page is present but the latter avoids that before any recyclable
> pages are knwon to be removed.
>
>> Comparing to the current proposed patch this patch
>> doesn't need neither the page upgrade code nor extra WAL-logging. If
>
> # By the way, my proposal was storing the information as Yura
> # proposed into stats collector. The information maybe be
> # available a bit lately, but it doesn't harm. This doesn't need
> # extra WAL logging nor the upgrad code:p
>
>> we also want to address cases other than append-only case we will
>
> I'm afraid that "the problem for the other cases" is a new one
> that this patch introduces, not an existing one.

I meant that the current Alexandor's proposal works for append-only
table. If we want to support other cases we have to update metapage
during bulk-delete, which assumes that bulk-delete always scans whole
index.

>
>> require the bulk-delete method of scanning whole index and of logging
>> WAL. But it leads some extra overhead. With this patch we no longer
>> need to depend on the full scan on b-tree index. This might be useful
>> for a future when we make the bulk-delete of b-tree index not scan
>> whole index.
>
> Perhaps I'm taking something incorrectly, but is it just the
> result of skipping 'maybe needed' scans without condiering the
> actual necessity?

I meant to scan only index pages that are relevant with garbages TIDs
on a table. The current b-tree index bulk-deletion is very slow and
heavy because we always scans the whole index even if there is only 1
dead tuples in a table. To address this problem I'm thinking a way to
make bulk-delete not scan whole index if there is a few dead tuples in
a table. That is, we do index scans to collect the stack of btree
pages and reclaim garbage. Maybe we will full index scan if there are
a lot of dead tuples, which would be same as what we're doing on
planning access paths.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2018-03-19 11:57:42 Re: Online enabling of checksums
Previous Message Etsuro Fujita 2018-03-19 11:35:44 Re: inserts into partitioned table may cause crash