Re: Vacuum: allow usage of more than 1GB of work mem

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Vacuum: allow usage of more than 1GB of work mem
Date: 2018-04-06 20:25:00
Message-ID: CAGTBQpbNUv8b1ho3w=3p8F-DsPKwrsnRXJ45dJQY9o6pziqq0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On 06/04/18 01:59, Claudio Freire wrote:
>>
>> The iteration interface, however, seems quite specific for the use
>> case of vacuumlazy, so it's not really a good abstraction.
>
>
> Can you elaborate? It does return the items one block at a time. Is that
> what you mean by being specific for vacuumlazy? I guess that's a bit
> special, but if you imagine some other users for this abstraction, it's
> probably not that unusual. For example, if we started using it in bitmap
> heap scans, a bitmap heap scan would also want to get the TIDs one block
> number at a time.

But you're also tying the caller to the format of the buffer holding
those TIDs, for instance. Why would you, when you can have an
interface that just iterates TIDs and let the caller store them
if/however they want?

I do believe a pure iterator interface is a better interface.

>> It also copies stuff a lot, so it's quite heavyweight. I'd suggest
>> trying to go for a lighter weight interface with less overhead that
>> is more general at the same time.
>
>
> Note that there was similar copying, to construct an array of OffsetNumbers,
> happening in lazy_vacuum_page() before this patch. So the net amount of
> copying is the same.
>
> I'm envisioning that this data structure will sooner or later be optimized
> further, so that when you have a lot of TIDs pointing to the same block, we
> would pack them more tightly, storing the block number just once, with an
> array of offset numbers. This interface that returns an array of offset
> numbers matches that future well, as the iterator could just return a
> pointer to the array of offset numbers, with no copying. (If we end up doing
> something even more dense, like a bitmap, then it doesn't help, but that's
> ok too.)

But that's the thing. It's a specialized interface for a future we're
not certain. It's premature.

A generic interface does not preclude the possibility of implementing
those in the future, it allows you *not* to if there's no gain. Doing
it now, it forces you to.

>> About the B-tree, however, I don't think a B-tree is a good idea.
>>
>> Trees' main benefit is that they can be inserted to efficiently. When
>> all your data is loaded sequentially, in-order, in-memory and
>> immutable; the tree is pointless, more costly to build, and harder to
>> maintain - in terms of code complexity.
>>
>> In this use case, the only benefit of B-trees would be that they're
>> optimized for disk access.
>
>
> Those are not the reasons for which I'd prefer a B-tree. A B-tree has good
> cache locality, and when you don't need to worry about random insertions,
> page splits, deletions etc., it's also very simple to implement. This patch
> is not much longer than the segmented multi-array.

But it *is* complex and less tested. Testing it and making it mature
will take time. Why do that if doing bitmaps is a better path?

>> On the other side, using B-trees incurs memory overhead due to the
>> need for internal nodes, can fragment memory because internal nodes
>> aren't the same size as leaf nodes, is easier to get wrong and
>> introduce bugs... I don't see a gain.
>
>
> The memory overhead incurred by the internal nodes is quite minimal, and can
> be adjusted by changing the node sizes. After some experimentation, I
> settled on 2048 items per leaf node, and 64 items per internal node. With
> those values, the overhead caused by the internal nodes is minimal, below
> 0.5%. That seems fine, but we could increase the node sizes to bring it
> further down, if we'd prefer that tradeoff.
>
> I don't understand what memory fragmentation problems you're worried about.
> The tree grows one node at a time, as new TIDs are added, until it's all
> released at the end. I don't see how the size of internal vs leaf nodes
> matters.

Large vacuums do several passes, so they'll create one tid map for
each pass. Each pass will allocate m_w_m-worth of pages, and then
deallocate them.

B-tree page allocations are smaller than malloc's mmap threshold, so
freeing them won't return memory to the operating system. Furthermore,
if other allocations get interleaved, objects could be left lying at
random points in the heap, preventing efficient reuse of the heap for
the next round. In essence, internal fragmentation. This may be
exacerbated by AllocSet's own fragmentation and double-accounting.
From what I can tell, inner nodes will use pools, and leaf nodes will
be allocated as dedicated chunks (mallocd).

Segments in my implementation are big, exactly because of that. The
aim is to have large buffers that malloc will mmap, so they get
returned to the os (unmapped) when freed quickly, and with little
overhead.

This fragmentation may cause actual pain in autovacuum, since
autovacuum workers are relatively long-lived.

>> If you propose its use, at least benchmark it to show some gain.
>
>
> Sure. I used the attached script to test this. It's inspired by the test
> script you posted. It creates a pgbench database with scale factor 100,
> deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() more
> heavily, the test script creates a number of extra indexes. Half of them are
> on the primary key, just to get more repetitions without having to
> re-initialize in between, and the rest are like this:
>
> create index random_1 on pgbench_accounts((hashint4(aid)))
>
> to stress lazy_vacuum_tid_reaped() with a random access pattern, rather than
> the sequential one that you get with the primary key index.
>
> I ran the test script on my laptop, with unpatched master, with your latest
> multi-array patch, and with the attached version of the b-tree patch. The
> results are quite noisy, unfortunately, so I wouldn't draw very strong
> conclusions from it, but it seems that the performance of all three versions
> is roughly the same. I looked in particular at the CPU time spent in the
> index vacuums, as reported by VACUUM VERBOSE.

Scale factor 100 is hardly enough to stress large m_w_m vacuum. I
found scales of 1k-4k (or bigger) are best, with large m_w_m settings
(4G for example) to get to really see how the data structure performs.

>> Furthermore, among the 200-ish messages this thread has accumulated,
>> better ideas have been proposed, better because they do use less
>> memory and are faster (like using bitmaps when possible), but if we
>> can't push a simple refactoring first, there's no chance a bigger
>> rewrite will fare better. Remember, in this use case, using less
>> memory far outweights any other consideration. Less memory directly
>> translates to less iterations over the indexes, because more can be
>> crammed into m_w_m, which is a huge time saving. Far more than any
>> micro-optimization.
>>
>> About 2 years ago, I chose to try to push this simple algorithm first,
>> then try to improve on it with better data structures. Nobody
>> complained at the time (I think, IIRC), and I don't think it fair to
>> go and revisit that now. It just delays getting a solution for this
>> issue for the persuit of "the perfect implementaiton" that might never
>> arrive. Or even if it doesn, there's nothing stopping us from pushing
>> another patch in the future with that better implementation if we
>> wish. Lets get something simple and proven first.
>
>
> True all that. My point is that the multi-segmented array isn't all that
> simple and proven, compared to an also straightforward B-tree. It's pretty
> similar to a B-tree, actually, except that it has exactly two levels, and
> the node (= segment) sizes grow exponentially. I'd rather go with a true
> B-tree, than something homegrown that resembles a B-tree, but not quite.

I disagree.

Being similar to what vacuum is already doing, we can be confident the
approach is sound, at least as sound as current vacuum. It shares a
lot with the current implementation, which is known to be good.

The multi-segmented array itself has received a lot of testing during
the ~1.5y it has spent in the making, as well. I've been running
extensive benchmarking and tests each time I changed something, and
I've even ran deep tests on an actual production snapshot. A complete
db-wide vacuum of a heavily bloated ~12TB production database, showing
significant speedups not only because it does fewer index scans, but
also because it uses less CPU time to do so, with consistency checking
after the fact, to check for bugs. Of course I can't guarantee it's
bug-free, but it *is* decently tested.

I'm pretty sure the B-tree implementation hasn't reached that level of
testing yet. It might in the future, but it won't happen overnight.

Your B-tree patch is also homegrown. You're not reusing well tested
btree code, you're coding a B-tree from scratch, so it's as suspect as
any new code. I agree the multi-segment algorithm is quite similar to
a shallow b-tree, but I'm not convinced a b-tree is what we must
aspire to have. In fact, if you used large pages for the B-tree, you
don't need more than 2 levels (there's a 12GB limit ATM on the size of
the tid map), so the multi-segment approach and the b-tree approach
are essentially the same. Except the multi-segment code got more
testing.

In short, there are other more enticing alternatives to try out first.
I'm not enthused by the idea of having to bench and test yet another
sorted set implementation before moving forward.

On Fri, Apr 6, 2018 at 11:00 AM, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> Heikki Linnakangas wrote:
>> On 06/04/18 01:59, Claudio Freire wrote:
>> > The iteration interface, however, seems quite specific for the use
>> > case of vacuumlazy, so it's not really a good abstraction.
>>
>> Can you elaborate? It does return the items one block at a time. Is that
>> what you mean by being specific for vacuumlazy? I guess that's a bit
>> special, but if you imagine some other users for this abstraction, it's
>> probably not that unusual. For example, if we started using it in bitmap
>> heap scans, a bitmap heap scan would also want to get the TIDs one block
>> number at a time.
>
> FWIW I liked the idea of having this abstraction possibly do other
> things -- for instance to vacuum brin indexes you'd like to mark index
> tuples as "containing tuples that were removed" and eventually
> re-summarize the range. With the current interface we cannot do that,
> because vacuum expects brin vacuuming to ask for each heap tuple "is
> this tid dead?" and of course we don't have a list of tids to ask for.
> So if we can ask instead "how many dead tuples does this block contain?"
> brin vacuuming will be much happier.

I don't think either patch gives you that.

The bulkdelete interface is part of the indexam and unlikely to change
in this patch.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2018-04-06 20:28:00 Re: [HACKERS] path toward faster partition pruning
Previous Message Tom Lane 2018-04-06 20:20:45 Re: WIP: a way forward on bootstrap data