Re: Free Space Map data structure

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Free Space Map data structure
Date: 2008-04-09 19:08:27
Message-ID: 47FD142B.1010407@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> ... what I really wanted to discuss is the data structure needed
>> for the Free Space Map.
>
>> The FSM data structure needs to support two basic operations:
>> 1. Fast lookup of page with >= X bytes of free space
>> 2. Update of arbitrary, individual pages.
>> Our current code doesn't support 2, as we always update the FSM in bulk
>> after vacuum, but we will need that capability to be able to do partial
>> vacuums in the future.
>
> Note that the lack of such infrastructure is mainly because we didn't
> need it, not because it couldn't be done.

Yep. I actually remember seeing a function in freespace.c in CVS history
to do partial updates, but it was removed at some point because it
wasn't needed.

>> One brilliant idea I had, is a binary heap/tree kind of structure, where
>> each heap page is represented by one leaf node.
>
> I'm worried about a couple of things here:
>
> * Loss of concurrency in access to the FSM itself, due to updates having
> to change large amounts of interrelated entries, and due to all
> processes always wanting to touch the root first.

When searching for free space, we start from the top, and go down from
there. We only need to take a shared lock on the (page that contains)
the upper nodes, so that shouldn't become a bottleneck.

When we update a node, we can stop propagating the change upwards as
soon as we hit a node where the other child has less space than the
current node. For example:

5
5 3
4 5 3
4 3 2 5 2 3

To update the first leaf node from 4 to 2, we need to update it's parent
to 3. But we don't need to update it's parent (5), and we can stop at
that point without accessing the root.

We probably want to mark new pages as full, as Hannu suggested, to avoid
repeatedly updating the FSM during bulk loads.

Given that the current FSM is protected by a single lock, shared by all
relations, and always taken in exclusive mode, I'm not too worried. I'm
not sure how long an FSM page needs to be locked in my scheme compared
to the current FSM, but at least there will be a separate FSM for each
relation. I do want to beat the current FSM in terms of scalability,
though, I think I've seen FreeSpaceLock become contented on some tests.

As a little optimization, we could optimistically start from somewhere
else than the top node when searching for free space. Perhaps from the
top of the FSM page that contained our previous rd_targetblock. I don't
think we need to, though.

> * Loss of concurrency in subsequent heap updates. The current FSM gets
> one thing right: if different backends ask for space, they will (if at
> all possible) be given different heap pages to insert into, and
> therefore will not suffer page-level contention while they do their
> insertions. I don't see what you'll do to ensure a comparable behavior
> with this structure.

Whenever we have a choice to traverse to either left or right child
because there's heap pages with enough free space on both sides, we can
randomize that decision to achieve the spreading.

This actually brings up another nice property this data structure has:
instead of randomizing, we can choose the child node that's "closer" to
some specific heap page. That can be used to keep an updated tuple
version close to the old version, or to maintain cluster order. That's a
goal that's at least somewhat at odds with the goal of spreading the
inserts, of course, but the data structure has the flexibility, which is
nice.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2008-04-09 19:17:09 Re: [PATCHES] libpq type system 0.9a
Previous Message Claudio Rossi 2008-04-09 18:56:26 Re: Segfault using heap_form_tuple