Re: Vacuum: allow usage of more than 1GB of work mem

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Vacuum: allow usage of more than 1GB of work mem
Date: 2018-04-06 13:39:21
Message-ID: cd8f7b62-17e1-4307-9f81-427922e5a1f6@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/04/18 01:59, Claudio Freire wrote:
> The iteration interface, however, seems quite specific for the use
> case of vacuumlazy, so it's not really a good abstraction.

Can you elaborate? It does return the items one block at a time. Is that
what you mean by being specific for vacuumlazy? I guess that's a bit
special, but if you imagine some other users for this abstraction, it's
probably not that unusual. For example, if we started using it in bitmap
heap scans, a bitmap heap scan would also want to get the TIDs one block
number at a time.

> It also copies stuff a lot, so it's quite heavyweight. I'd suggest
> trying to go for a lighter weight interface with less overhead that
> is more general at the same time.

Note that there was similar copying, to construct an array of
OffsetNumbers, happening in lazy_vacuum_page() before this patch. So the
net amount of copying is the same.

I'm envisioning that this data structure will sooner or later be
optimized further, so that when you have a lot of TIDs pointing to the
same block, we would pack them more tightly, storing the block number
just once, with an array of offset numbers. This interface that returns
an array of offset numbers matches that future well, as the iterator
could just return a pointer to the array of offset numbers, with no
copying. (If we end up doing something even more dense, like a bitmap,
then it doesn't help, but that's ok too.)

> About the B-tree, however, I don't think a B-tree is a good idea.
>
> Trees' main benefit is that they can be inserted to efficiently. When
> all your data is loaded sequentially, in-order, in-memory and
> immutable; the tree is pointless, more costly to build, and harder to
> maintain - in terms of code complexity.
>
> In this use case, the only benefit of B-trees would be that they're
> optimized for disk access.

Those are not the reasons for which I'd prefer a B-tree. A B-tree has
good cache locality, and when you don't need to worry about random
insertions, page splits, deletions etc., it's also very simple to
implement. This patch is not much longer than the segmented multi-array.

> On the other side, using B-trees incurs memory overhead due to the
> need for internal nodes, can fragment memory because internal nodes
> aren't the same size as leaf nodes, is easier to get wrong and
> introduce bugs... I don't see a gain.

The memory overhead incurred by the internal nodes is quite minimal, and
can be adjusted by changing the node sizes. After some experimentation,
I settled on 2048 items per leaf node, and 64 items per internal node.
With those values, the overhead caused by the internal nodes is minimal,
below 0.5%. That seems fine, but we could increase the node sizes to
bring it further down, if we'd prefer that tradeoff.

I don't understand what memory fragmentation problems you're worried
about. The tree grows one node at a time, as new TIDs are added, until
it's all released at the end. I don't see how the size of internal vs
leaf nodes matters.

> If you propose its use, at least benchmark it to show some gain.

Sure. I used the attached script to test this. It's inspired by the test
script you posted. It creates a pgbench database with scale factor 100,
deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped()
more heavily, the test script creates a number of extra indexes. Half of
them are on the primary key, just to get more repetitions without having
to re-initialize in between, and the rest are like this:

create index random_1 on pgbench_accounts((hashint4(aid)))

to stress lazy_vacuum_tid_reaped() with a random access pattern, rather
than the sequential one that you get with the primary key index.

I ran the test script on my laptop, with unpatched master, with your
latest multi-array patch, and with the attached version of the b-tree
patch. The results are quite noisy, unfortunately, so I wouldn't draw
very strong conclusions from it, but it seems that the performance of
all three versions is roughly the same. I looked in particular at the
CPU time spent in the index vacuums, as reported by VACUUM VERBOSE.

> Furthermore, among the 200-ish messages this thread has accumulated,
> better ideas have been proposed, better because they do use less
> memory and are faster (like using bitmaps when possible), but if we
> can't push a simple refactoring first, there's no chance a bigger
> rewrite will fare better. Remember, in this use case, using less
> memory far outweights any other consideration. Less memory directly
> translates to less iterations over the indexes, because more can be
> crammed into m_w_m, which is a huge time saving. Far more than any
> micro-optimization.
>
> About 2 years ago, I chose to try to push this simple algorithm first,
> then try to improve on it with better data structures. Nobody
> complained at the time (I think, IIRC), and I don't think it fair to
> go and revisit that now. It just delays getting a solution for this
> issue for the persuit of "the perfect implementaiton" that might never
> arrive. Or even if it doesn, there's nothing stopping us from pushing
> another patch in the future with that better implementation if we
> wish. Lets get something simple and proven first.

True all that. My point is that the multi-segmented array isn't all that
simple and proven, compared to an also straightforward B-tree. It's
pretty similar to a B-tree, actually, except that it has exactly two
levels, and the node (= segment) sizes grow exponentially. I'd rather go
with a true B-tree, than something homegrown that resembles a B-tree,
but not quite.

> I'm attaching again one version of them (I've been modifying it to
> suit my purposes at each review round), you'll probably want to tweak
> it to build test cases good for your purpose here.

Thanks!

Attached is a new version of my b-tree version. Compared to yesterday's
version, I fixed a bunch of bugs that turned up in testing.

Looking at the changes to the regression test in this, I don't quite
understand what it's all about. What are the "wait_barriers" for? If I
understand correctly, they're added so that the VACUUMs can remove the
tuples that are deleted in the test. But why are they needed now? Was
that an orthogonal change we should've done anyway?

Rather than add those wait_barriers, should we stop running the 'vacuum'
test in parallel with the other tests? Or maybe it's a good thing to run
it in parallel, to test some other things?

What are the new tests supposed to cover? The test comment says "large
mwm vacuum runs", and it sets maintenance_work_mem to 1 MB, which isn't
very large

- Heikki

Attachment Content-Type Size
vacuum-tidmap-2.patch text/x-patch 37.9 KB
vacuumbench.sh application/x-shellscript 2.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2018-04-06 13:40:13 Re: Get the name of the target Relation from Query struct? SOLVED!
Previous Message Ashutosh Bapat 2018-04-06 13:35:54 Re: Get the name of the target Relation from Query struct?