Re: Vacuum: allow usage of more than 1GB of work mem

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <Jim(dot)Nasby(at)bluetreble(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Vacuum: allow usage of more than 1GB of work mem
Date: 2016-09-08 18:10:31
Message-ID: CAD21AoAejVzY-0WivN5HfxKiyXcoaWushvAvgpS86N43u2CitA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 8, 2016 at 11:54 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
>
> On Wed, Sep 7, 2016 at 10:18 PM, Claudio Freire <klaussfreire(at)gmail(dot)com>
> wrote:
>>
>> On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> > On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs <simon(at)2ndquadrant(dot)com>
>> > wrote:
>> >> On 6 September 2016 at 19:59, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >>
>> >>> The idea of looking to the stats to *guess* about how many tuples are
>> >>> removable doesn't seem bad at all. But imagining that that's going to
>> >>> be
>> >>> exact is folly of the first magnitude.
>> >>
>> >> Yes. Bear in mind I had already referred to allowing +10% to be safe,
>> >> so I think we agree that a reasonably accurate, yet imprecise
>> >> calculation is possible in most cases.
>> >
>> > That would all be well and good if it weren't trivial to do what
>> > Robert suggested. This is just a large unsorted list that we need to
>> > iterate throught. Just allocate chunks of a few megabytes and when
>> > it's full allocate a new chunk and keep going. There's no need to get
>> > tricky with estimates and resizing and whatever.
>>
>> I agree. While the idea of estimating the right size sounds promising
>> a priori, considering the estimate can go wrong and over or
>> underallocate quite severely, the risks outweigh the benefits when you
>> consider the alternative of a dynamic allocation strategy.
>>
>> Unless the dynamic strategy has a bigger CPU impact than expected, I
>> believe it's a superior approach.
>>
>
> How about a completely different representation for the TID array? Now this
> is probably not something new, but I couldn't find if the exact same idea
> was discussed before. I also think it's somewhat orthogonal to what we are
> trying to do here, and will probably be a bigger change. But I thought I'll
> mention since we are at the topic.
>
> What I propose is to use a simple bitmap to represent the tuples. If a tuple
> at <block, offset> is dead then the corresponding bit in the bitmap is set.
> So clearly the searches through dead tuples are O(1) operations, important
> for very large tables and large arrays.
>
> Challenge really is that a heap page can theoretically have MaxOffsetNumber
> of line pointers (or to be precise maximum possible offset number). For a 8K
> block, that comes be about 2048. Having so many bits per page is neither
> practical nor optimal. But in practice the largest offset on a heap page
> should not be significantly greater than MaxHeapTuplesPerPage, which is a
> more reasonable value of 291 on my machine. Again, that's with zero sized
> tuple and for real life large tables, with much wider tuples, the number may
> go down even further.
>
> So we cap the offsets represented in the bitmap to some realistic value,
> computed by looking at page density and then multiplying it by a small
> factor (not more than two) to take into account LP_DEAD and LP_REDIRECT line
> pointers. That should practically represent majority of the dead tuples in
> the table, but we then keep an overflow area to record tuples beyond the
> limit set for per page. The search routine will do a direct lookup for
> offsets less than the limit and search in the sorted overflow area for
> offsets beyond the limit.
>
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.
>
> I guess we can make a choice between two representations at the start
> looking at various table stats. We can also be smart and change from bitmap
> to traditional representation as we scan the table and see many more tuples
> in the overflow region than we provisioned for. There will be some
> challenges in converting representation mid-way, especially in terms of
> memory allocation, but I think those can be sorted out if we think that the
> idea has merit.
>

Making the vacuum possible to choose between two data representations
sounds good.
I implemented the patch that changes dead tuple representation to bitmap before.
I will measure the performance of bitmap representation again and post them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2016-09-08 18:35:44 Re: Add support for restrictive RLS policies
Previous Message Stephen Frost 2016-09-08 18:01:18 Re: Default make target in test modules