Re: Relation forks & FSM rewrite patches

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: Relation forks & FSM rewrite patches
Date: 2008-07-04 11:10:44
Message-ID: 1215169844.4051.95.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-patches


On Fri, 2008-07-04 at 12:27 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Fri, 2008-06-27 at 19:36 +0300, Heikki Linnakangas wrote:
> >
> >> Here's an updated version of the "relation forks" patch, and an
> >> incremental FSM rewrite patch on top of that. The relation forks patch
> >> is ready for review. The FSM implementation is more work-in-progress
> >> still, but I'd like to get some review on that as well, with the goal of
> >> doing more performance testing and committing it after the commit fest.
> >
> > Hmmm, a 6000 line page with no visible documentation, readme, nor any
> > discussion on -hackers that I'm aware of.
>
> There is a readme in the patch, and there certainly has been discussion
> on -hackers, see:
>
> http://archives.postgresql.org/pgsql-hackers/2008-04/msg00415.php
>
> where the current design is discussed for the first time.

OK, I see the readme now. Thanks.

Minor point but the readme needs to draw a clear distinction between FSM
pages and data pages, which confused me when I read it first time.

Few thoughts:

Contention on FSM pages concerns me. Every change has the potential to
bubble up to the top, which would cause a significant problem. I'd like
to find a way to minimise the number of bubble-up operations, otherwise
there will be no material difference between using a single whole-FSM
lock like we do now and this new scheme. Note that in this new scheme
the path length to the bottom of the tree is considerably longer and can
stall waiting on I/O - so contention in the FSM is a big issue. (It
always has been in databases as long as I can remember).

The FSM tree as proposed has exact values. What if we bubbled up based
upon only significant tree changes, rather than exact changes? After
all, we only really care about whether we can fit one tuple in, so
changes smaller than avg row length have no real benefit, just
contention. So perhaps we can perform bubble-up only when the change in
free space in greater than avg row size or the remaining space has
dropped to less than 2*avg row size, where exact values begin to matter
more. That way fewer changes bubble up and fewer write locks needed on
higher level pages. We probably need to track row size variation to make
this work effectively in some cases.

Also, as FSM map becomes empty we will see more and more bubble up
operations reaching top parts of the tree. We need a way to avoid
contention from growing over time.

I'm not happy about the FSM being WAL logged for minor changes (new
pages, yes). The high number of changes will cause extra traffic where
we don't want it. This will accentuate the locks in key areas and will
introduce additional delays into the code paths that use FSM, but don't
currently write WAL. Writing WAL will further exacerbate the expected
contention around the FSM root page.

We will write dirty FSM pages at checkpoint, so just allow that rather
than logging every change. If we crash, going back to the last
checkpoint is probably OK, since all mistakes will cause corrections in
the FSM anyway and it will soon correct itself. If the values are only
approximate this will make the post-crash imprecision of the FSM less
important anyway. If you're worried about sending the FSM to standby
servers, then we can copy it to WAL once every few checkpoints.

The tree structure seems regular. Can we jump straight to bottom of tree
when its clear that there's tons of space available in certain part of
the table?

One of the current functions of the FSM is to hand out a different
target block to each backend. This naturally reduces contention during
heavy writes. The current design uses random(), but I fear that may not
be very useful when the number of useful routes is reduced. Is there a
more deterministic and uniformly better way of separating out backends?
Using the bits of the pid to send the search different ways?

(On a humorous note, it surprises me that you have no time to work on
index enhancements, so the first thing you do is write a custom index
for the FSM :-)

> User documentation is still to be done. There isn't anything to tune, or
> anything that requires manual maintenance, so there isn't much to
> document, though, except perhaps a chapter in the "Internals" part.

Yeh, can't see anything I'd want a parameter for.

> Here's an updated version of the patch. Changes:
> - one bug has been fixed (I assumed that it's always safe to call
> rightchild(x) on a parent node, but that was not always true for the
> rightmost branch of the tree)
> - there's some new debugging code.
> - if we can't find a page with enough free space in search_avail after
> starting from scratch 1000 times, give up. That shouldn't happen, but
> see below.

Such a heuristic seems useful. I'd suggest that FSM page locks be
usually taken conditionally after the root level. So if one route is
busy, try another, unless the caller prefers to wait to allow keeping
the heap in order. Presumably the FSM can page to disk, so some waits
may be much longer than anybody cares to wait across.

> There is still a nasty bug somewhere, probably related to locking :-(.

Thats a generally true statement :-)

> I
> ran DBT-2 with this, and after about 1h a FSM lookup goes into an
> endless loop, hogging all CPU. I suspect that there's a bug somewhere so
> that a change to the root node of a lower level FSM page isn't
> propagated to the upper level FSM page for some reason. oprofile shows
> that the endless loop happens in search_avail. This is why I added the
> code to give up after 1000 tries, hoping to get some debugging output
> the next time that happens.

Maybe re-initialising the level value when jumping between pages, so it
restarts at level zero. Maybe call them level 1+ so you can spot that
more easily.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

In response to

Responses

Browse pgsql-patches by date

  From Date Subject
Next Message Zdenek Kotala 2008-07-04 11:13:03 Re: page macros cleanup
Previous Message Pavan Deolasee 2008-07-04 11:05:36 Re: page macros cleanup