Free-space-map management thoughts

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: Stephen Marshall <smarshall(at)wsicorp(dot)com>
Subject: Free-space-map management thoughts
Date: 2003-02-26 20:54:05
Message-ID: 6189.1046292845@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've been thinking about improving the algorithm that the free space map
(FSM) uses to decide what to store when it's not got enough shared
memory to keep track of everything. The present design uses a dynamically
adjusted threshold for each relation, throwing away pages whose free
space amount is less than the threshold. This unfortunately seems not
to work as well as I'd hoped when I wrote it :-(. In particular it
cannot cope effectively with situations where many pages have exactly
the same amount of free space --- it degenerates to all-or-nothing.
This problem has become acute now that btree indexes use the FSM to
keep track of free pages: by definition, all those pages have exactly
the same amount of free space.

I had some off-list discussions last fall with Steve Marshall, who
was trying to improve the thresholding algorithm to work better, but
what he ended up with seemed to me to have a lot of ad-hoc logic and
magic numbers in it. So I wasn't real satisfied with that.

This is a request for comments about the following redesign:

1. Continue to keep track of the average request size seen by
GetPageWithFreeSpace(), but make it a pure request-size average; don't
muck it up with thresholding adjustments. Add an entry point to make
the average request size for a relation available to callers. VACUUM
can use this as a starting point for its internal decisions about which
pages are even worth reporting to the FSM.

2. Eliminate "retail" addition of FSM entries. RecordFreeSpace() isn't
being used anyway, and we can restrict RecordAndGetPageWithFreeSpace()
to only update existing entries not make new ones. Only wholesale
replacement of a relation's FSM data, via MultiRecordFreeSpace(), need
be supported as a way of adding page entries to FSM.

3. With the above two changes, the numbers of pages passed to the FSM
by MultiRecordFreeSpace() calls become useful statistics in themselves.
We can keep track of those numbers in the FSM's per-relation statistics
independently of the number of pages actually stored in FSM.

4. In particular, the sum of the MultiRecordFreeSpace (henceforth MRFS)
page counts gives us the total number of pages we would *like* to keep
track of; the ratio of this number to the actual allocated max_fsm_pages
is our "oversubscription ratio". (One thing we can do in passing is
make VACUUM VERBOSE print these numbers, so that people finally have
some intelligent way of adjusting their FSM parameters.)

5. When FSM space is oversubscribed, we can divide each relation's MRFS
requested page count by the oversubscription ratio to arrive at an exact
target page count for each relation. This page count is stable as long
as the space-allocation picture isn't changing much, which is a big
improvement over the existing inherently history-dependent thresholding
algorithm.

6. A reasonably effective way to reduce a relation's stored page list
to the target (or select out the pages to actually remember from an MRFS
request) is as follows:
* Prescan the page data to compute a histogram of available-space
values, with maybe 32 bins.
* Find the histogram bin whose inclusion would make us exceed the target
page count. Set thresholdL = its lower edge value, thresholdU = its
upper edge value, and bincount = target page count minus sum of counts
in higher histogram bins.
* Scan pages. Keep all those >= thresholdU; drop those < thresholdL;
of the pages between the thresholds, keep the first bincount entries.
This approach will drop some pages with more free space while keeping
some with less --- but the difference between those dropped and those
kept is no more than the bin width.

This algorithm still requires some heuristics around the edges --- in
particular, in step 5 we probably don't want a purely linear division of
available space, but should allow some minimum number of pages for each
relation before divvying up the leftover space. But it doesn't seem to
need nearly as much ad-hoc tuning as the thresholding approach does.

One thing this does not do that the original code tries to do is give
preference to more-heavily-used relations --- in this approach, each
table's share of FSM is determined only by its number of pages with
free space. However, one could argue that that is an indirect measure
of activity, since it certainly reflects past deletion/update activity.
So it may be okay not to give any explicit preference to active tables.

Comments anyone?

regards, tom lane

PS: Another idea I'm toying with is to dump out the FSM contents at
postmaster shutdown and reload them at restart, so that the FSM doesn't
have to start from ground zero on every restart cycle. But that's
independent of the management algorithm...

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2003-02-26 21:13:31 Hey! I thought this was fixed in 7.2.4
Previous Message greg 2003-02-26 20:46:15 XML ouput for psql