Rewriting Free Space Map

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Rewriting Free Space Map
Date: 2008-03-16 21:57:09
Message-ID: 47DD97B5.5020309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've started working on revamping Free Space Map, using the approach
where we store a map of heap pages on every nth heap page. What we need
now is discussion on the details of how exactly it should work.

Here's my rough plan on the implementation. It's long, sorry. I'm fairly
confident with it, but please let me know if you see any problems or
have any suggestions or better ideas.

Heap FSM
--------

The FSM is stored in the special area of every nth heap page. When
extending the relation, the heapam checks if the block number of the new
page is one that belongs to the FSM. If it is, it let's the FSM to
initialize it by calling InitFSMPage() on it, and extends the relation
again to get another, normal heap page.

I chose the "every nth page is an FSM page" approach, rather than using
a separate relfilenode, which I also considered. The separate
relfilenode approach has some advantages, like being able to scan all
FSM pages in a sequential fashion, but it involves a fair amount of
catalog and buffer manager changes.

It's convenient that the FSM uses up the whole page, leaving no room for
heap tuples. It simplifies the locking, as we don't need to worry with
the possibility in the FSM that the caller is already holding a lock on
the same page.

In an FSM page, there's one byte for each of the next N heap pages,
starting from the FSM page. That one byte stores the amount of free
space on the corresponding heap page, in BLCKSZ/256 byte precision (32
bytes with default block size).

The mapping of free space to these 256 "buckets" wouldn't necessarily
have to be a linear one, we could for example have a single bucket for
pages with more than BLCKSZ/2 bytes of free space and divide the rest
linearly into 16 byte buckets, but let's keep it simple for now. Of
course, we could also just use 2 bytes per page, and store the page size
exactly, but 32 byte precision seems like enough to me.

Index FSM
---------

Indexes use a similar scheme, but since we only need to keep track
whether a page is used or not, we only need one bit per page. If the
amount of free space on pages is interesting for an indexam in the
future, it can use the heap FSM implementation instead. Or no FSM at
all, like the hash indexam.

To use the index FSM, the indexam needs to leave every nth page alone,
like in the heap. The B-tree assumes that the metapage is at block 0,
but that's also where we would like to store the first index FSM page.
To overcome that, we can make the index FSM special area a little bit
smaller, so that the B-tree metadata fits on the same page as the FSM
information. That will be wasted space on other FSM pages than block 0,
but we're only talking about 24 bytes per FSM page, and we only need one
FSM page per ~512 MB of index pages (with default BLCKSZ).

Performance
-----------

The patch I'm working on currently uses a naive way to find a page in
the FSM. To find a page with X bytes of free space, it scans the FSM
pages until one is found. And it always starts the scan from the
beginning, which is far from optimal. And when there's page with enough
free space, it still needs to scan all FSM pages just to find out that
we need to extend the relation.

To speed things up, we're going to need some mechanism to avoid that.
First of all, we need to somehow cache the information that "there's no
page with >= X bytes left", to avoid fruitless scanning. To speed up the
case when there's only a few pages with enough free space, we can keep a
limited size list of such pages in addition to the map.

These information needs to be in shared memory, either on heap pages
like the FSM pages and managed by the buffer manager, or in a separate
shmem block. I would like to go with normal bufmgr managed pages, as
fixed-sized memory blocks have their problems, and the cached
information should be stored to disk as well.

Let's have one special page in the heap file, called the Free Space List
(FSL) page, in addition to the normal FSM pages. It has the following
structure:

struct {
bit anypages[256]

struct {
BlockNumber blockno;
uint8 freespace;
} freespacelist[as large as fits on page]
}

Remember that we track the free space on each page using one byte, IOW,
each page falls into one of 256 buckets of free space. In the anypages
bitmap, we have one bit per bucket indicating "is there any pages with
this much free space". When we look for a page with X bytes, we check
the bits up to the bucket corresponding X bytes, and if there's no set
bits we know not to bother scanning the FSM pages.

To speed up the scan where there is space, we keep a simple list of
pages with free space. This list is actually like the current FSM, but
here we only use it as a small cache of the FSM pages. VACUUM and any
other operations that update the FSM can put pages to the list when
there's free slots.

We can store the FSL page on a magic location, say block #100. For
relations smaller than that, there's no need for the FSL and we might as
well scan the FSM page. I'm not sure if we should have more than one FSL
page for large tables.

I'm not sure yet how the locking of FSL and FSM pages should work. It
shouldn't be too hard, though, as the FSM/FSL information are just hints
anyway. We do need to take care that we don't permanently lose track of
any significant amount of free space, as after we can do partial vacuums
using the visibility map, VACUUM might not visit one part of a table
even if other parts it are frequently updated.

Page allocation algorithm
-------------------------

There's many different ways we can hand out pages from the FSM. Possible
strategies, some of which are at odds with each other, include:

1. Try to spread out inserts of different backends, to avoid contention
2. Prefer low-numbered pages, to increase the chance of being able to
truncate in VACUUM.
3. Reserve pages with lots of free space for big allocations, prefer
almost full pages for small allocations. To use all space more efficiently.
4. If the table is clustered, try to use pages close to those with
similar values.
5. On UPDATE, try to use pages close to the old tuple.
6. Prefer pages that are currently in cache, to avoid I/O.

The current FSM tries to achieve only 1, and there haven't been many
complaints, so I wouldn't worry too much about the other goals. 4 and 6
would need some major changes to the buffer manager and indexam
interfaces, respectively, and 3 could lead to more I/O when you do a lot
of small inserts.

We can spread inserts of different backends in the Free Space List by
moving a page to the end of the list when it's handed out, or removing
it from there altogether. When the FSL is empty, we can vary the place
where we start to scan the FSM.

To prefer low-number pages, to increase the chance of being able to
truncate the relation later, we can favor lower numbered blocks in
VACUUM when we decide which blocks to put into the FSL. And we can also
bias the starting point of where we start to scan the FSM when the FSL
is empty.

Visibility Map
--------------

This far I've only talked about the FSM, but it's important to consider
how the Visibility Map fits into the scheme. My current thinking is that
there will be one bit per heap page in the visibility map. The exact
semantics of that one bit is still not totally clear, but that's not
important right now.

There's a few alternatives:
1. Mix the visibility map with the FSM, stealing one bit of every FSM
byte. There would then be 7 bits for storing how much space there is on
each page, and 1 bit indicating the visibility.

2. Allocate part of the FSM pages for visibility map. For example, First
1/9 of the page for VM, and 8/9 for the FSM. The VM part would be a
straight bitmap with one bit per page, and the FSM part would use one
byte per page.

3. Use different pages for the VM, for example every nth page for VM,
and every (n+1)th page for FSM.

I'm leaning towards 2 at the moment. 3 is intriguing as well, though,
because it would help with potential lock contention on the VM/FSM pages.

Per-chunk relfrozenxid
----------------------
I'm imagining that we would set a bit in VM, if a page has no dead
tuples at all, making it possible to use it for index-only scans. Such a
page is also uninteresting for VACUUM. However, we would still need to
scan the whole table to freeze tuples.

To alleviate that, we could allocate some space in the FSM pages,
indicating a smallest Xid in a range of heap pages. IOW, like
relfrozenxid, but with a granularity of N pages, rather than whole
relation. You would still have to scan that range of pages to update it,
but that's much better than the whole relation.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Cristian Gafton 2008-03-16 22:26:16 Re: Single table forcing sequential scans on query plans
Previous Message Tom Lane 2008-03-16 19:50:13 Hash index build patch has *worse* performance at small table sizes