Re: Have vacuum emit a warning when it runs out of maintenance_work_mem

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject: Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date: 2007-05-13 23:42:36
Message-ID: 21441.1179099756@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>>> Or we could switch to a more compact representation of the dead tuples,
>>> and not need such a big maintenance_work_mem in the first place.

> One idea is to use a compressed bitmap like in the bitmap index patch,
> and a tree of block numbers or ranges to allow random access to it.

I thought a bit about that but it doesn't seem tremendously appealing,
at least not as the only representation, because it's not more compact
for small numbers of dead tuples per page. (And we don't have the "out"
of switching to lossy representation.)

Here's a design sketch that works if we are willing to limit VACUUM's
usable maintenance_work_mem to 4GB:

1. Store an array of page numbers plus offsets into a second working
array of uint16 (the offsets are 32 bits, whence the 4GB limitation).
This requires 8 bytes per page-with-dead-tuples, and since it will be
built in order as a byproduct of our scanning order, it can be
binary-searched on the page number.

2. The offset part of the per-page entry points at a segment of the
uint16 array belonging to this page. It can have one of 2 formats.
For a small number of dead tuples on the page, we just store an
array of line numbers. For a larger number, we store a bitmap
showing the positions of dead tuples. While we scan a page, we
accumulate dead tuple line numbers in a small working array, and
then either copy those to the large array or build a bitmap from
them, depending on which will be smaller. Since the offsets into
the uint16 array will always be even, we can usurp the low-order
bit of the pointer word to distinguish which representation is
stored.

3. You might think we need to spend an additional word storing
how many line numbers or bitmap words there are per page, but
we can save that space by comparing offsets of adjacent entries
in the per-page array, since we know they're stored adjacently.

I envision the per-page array as being built upwards from the bottom of
a single large maintenance_work_mem-sized array, and the uint16 array
data as being filled from the top down, and whenever the two pointers
are in danger of crossing, we stop and do an index vacuum cycle, just
like in the current logic. This lets us avoid having to guess in
advance how much per-page versus per-tuple space we will need. Note
this means the end of a page entry's uint16 data is determined by
looking at the prior page entry's offset instead of the next one,
but that seems no big problem.

So the lookup process involves a binary search on the page number only,
and then either a scan of the tuple line numbers or a single bitmap
probe. (We could make the scan be a binary search, but since that
representation will only be used with small numbers of tuples, it'd
probably not be any faster than a simple search loop.) AFAICS that
ought to be as fast or faster than the current lookup methodology;
significantly faster where there are many dead tuples per page.

The space requirements are:

No dead tuples on page 0 bytes (same as now)
1 dead tuple on page 10 bytes (vs 6 now)
2 dead tuples 12 bytes (same as now)
3 dead tuples 14 bytes (vs 18 now)

and so on, except that for typical table densities of say 100 tuples per
page, we will switch over to the bitmap representation at 6 or so dead
tuples per page, and so the requirement will never go beyond about 20
bytes per page whereas the current method could get as bad as 600 bytes
for an all-dead page.

What this says is that it's not worth changing if you expect low
dead-tuple densities (which IIRC was the assumption I made when I
designed the current representation). In a table with an average of
less than 1 dead tuple per page, this way is a loser. OTOH, with very
low dead-tuple densities it may hardly matter, since you're going to
have to scan many GB of heap before you fill maintenance_work_mem
anyway.

If you assume that a more typical scenario is vacuuming after 10%
or so tuple "churn", then there would be 10 or so dead tuples per
page, which makes this a good win: about 20 vs about 60 bytes per
page, with the win going up the longer vacuum is delayed.

HOT would take away some of the win, probably, but I'm not sure
how much.

Comments? Can anyone invent a better data structure?

regards, tom lane

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message David Fetter 2007-05-14 01:17:40 Autovacuum and XID wraparound
Previous Message Gregory Stark 2007-05-13 23:05:18 Re: Concurrent psql patch