Re: Vacuum: allow usage of more than 1GB of work mem

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Vacuum: allow usage of more than 1GB of work mem
Date: 2018-04-05 22:59:59
Message-ID: CAGTBQpZWqQDpuaGtjFQ6H9F11WP12g9m5oe-YXV8hMs_0M2+wQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On Thu, Apr 5, 2018 at 5:02 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On 03/04/18 17:20, Claudio Freire wrote:
>>
>> Ok, rebased patches attached
>
>
> Thanks! I took a look at this.
>
> First, now that the data structure is more complicated, I think it's time to
> abstract it, and move it out of vacuumlazy.c. The Tid Map needs to support
> the following operations:
>
> * Add TIDs, in order (in 1st phase of vacuum)
> * Random lookup, by TID (when scanning indexes)
> * Iterate through all TIDs, in order (2nd pass over heap)
>
> Let's add a new source file to hold the code for the tid map data structure,
> with functions corresponding those operations.
>
> I took a stab at doing that, and I think it makes vacuumlazy.c nicer.

About the refactoring to split this into their own set of files and
abstract away the underlying structure, I can totally get behind that.

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction. It also
copies stuff a lot, so it's quite heavyweight. I'd suggest trying to
go for a lighter weight interface with less overhead that is more
general at the same time.

If it was C++, I'd say build an iterator class.

C would do it probably with macros, so you can have a macro to get to
the current element, another to advance to the next element, and
another to check whether you've reached the end.

I can do that if we agree on the points below:

> Secondly, I'm not a big fan of the chosen data structure. I think the only
> reason that the segmented "multi-array" was chosen is that each "segment"
> works is similar to the simple array that we used to have. After putting it
> behind the abstraction, it seems rather ad hoc. There are many standard
> textbook data structures that we could use instead, and would be easier to
> understand and reason about, I think.
>
> So, I came up with the attached patch. I used a B-tree as the data
> structure. Not sure it's the best one, I'm all ears for suggestions and
> bikeshedding on alternatives, but I'm pretty happy with that. I would expect
> it to be pretty close to the simple array with binary search in performance
> characteristics. It'd be pretty straightforward to optimize further, and
> e.g. use a bitmap of OffsetNumbers or some other more dense data structure
> in the B-tree leaf pages, but I resisted doing that as part of this patch.

About the B-tree, however, I don't think a B-tree is a good idea.

Trees' main benefit is that they can be inserted to efficiently. When
all your data is loaded sequentially, in-order, in-memory and
immutable; the tree is pointless, more costly to build, and harder to
maintain - in terms of code complexity.

In this use case, the only benefit of B-trees would be that they're
optimized for disk access. If we planned to store this on-disk,
perhaps I'd grant you that. But we don't plan to do that, and it's not
even clear doing it would be efficient enough for the intended use.

On the other side, using B-trees incurs memory overhead due to the
need for internal nodes, can fragment memory because internal nodes
aren't the same size as leaf nodes, is easier to get wrong and
introduce bugs... I don't see a gain. If you propose its use, at least
benchmark it to show some gain.

So I don't think B-tree is a good idea, the sorted array already is
good enough, and if not, it's at least close to the earlier
implementation and less likely to introduce bugs.

Furthermore, among the 200-ish messages this thread has accumulated,
better ideas have been proposed, better because they do use less
memory and are faster (like using bitmaps when possible), but if we
can't push a simple refactoring first, there's no chance a bigger
rewrite will fare better. Remember, in this use case, using less
memory far outweights any other consideration. Less memory directly
translates to less iterations over the indexes, because more can be
crammed into m_w_m, which is a huge time saving. Far more than any
micro-optimization.

About 2 years ago, I chose to try to push this simple algorithm first,
then try to improve on it with better data structures. Nobody
complained at the time (I think, IIRC), and I don't think it fair to
go and revisit that now. It just delays getting a solution for this
issue for the persuit of "the perfect implementaiton" that might never
arrive. Or even if it doesn, there's nothing stopping us from pushing
another patch in the future with that better implementation if we
wish. Lets get something simple and proven first.

> I haven't done any performance testing of this (and not much testing in
> general), but at least the abstraction seems better this way. Performance
> testing would be good, too. In particular, I'd like to know how this might
> affect the performance of lazy_tid_reaped(). That's a hot spot when
> vacuuming indexes, so we don't want to add any cycles there. Was there any
> ready-made test kits on that in this thread? I didn't see any at a quick
> glance, but it's a long thread..

If you dig old messages in the thread, I had attached the scripts I
used for benchmarking this.

I'm attaching again one version of them (I've been modifying it to
suit my purposes at each review round), you'll probably want to tweak
it to build test cases good for your purpose here.

Attachment Content-Type Size
vacuumbench.sh application/x-shellscript 2.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2018-04-05 23:17:48 Re: Checkpoint not retrying failed fsync?
Previous Message Edmund Horner 2018-04-05 22:46:27 Re: Fix for pg_stat_activity putting client hostaddr into appname field