Improving free space usage (was: Reducing relation locking overhead)

From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Hannu Krosing <hannu(at)skype(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Gregory Maxwell <gmaxwell(at)gmail(dot)com>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Improving free space usage (was: Reducing relation locking overhead)
Date: 2005-12-08 18:57:02
Message-ID: 20051208185702.GB58449@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 08, 2005 at 11:58:50AM +0200, Hannu Krosing wrote:
> ??hel kenal p??eval, N, 2005-12-08 kell 01:08, kirjutas Jim C. Nasby:
> > On Thu, Dec 08, 2005 at 08:57:42AM +0200, Hannu Krosing wrote:
> > > ??hel kenal p??eval, N, 2005-12-08 kell 00:16, kirjutas Jim C. Nasby:
> > > > On Sat, Dec 03, 2005 at 10:15:25AM -0500, Greg Stark wrote:
> > > > > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> > > > > > What's worse, once you have excluded writes you have to rescan the entire
> > > > > > table to be sure you haven't missed anything. So in the scenarios where this
> > > > > > whole thing is actually interesting, ie enormous tables, you're still
> > > > > > talking about a fairly long interval with writes locked out. Maybe not as
> > > > > > long as a complete REINDEX, but long.
> > > > >
> > > > > I was thinking you would set a flag to disable use of the FSM for
> > > > > inserts/updates while the reindex was running. So you would know where to find
> > > > > the new tuples, at the end of the table after the last tuple you read.
> > > >
> > > > What about keeping a separate list of new tuples? Obviously we'd only do
> > > > this when an index was being built on a table.
> > >
> > > The problem with separate list is that it can be huge. For example on a
> > > table with 200 inserts/updates per second an index build lasting 6 hours
> > > would accumulate total on 6*3600*200 = 4320000 new tuples.
> >
> > Sure, but it's unlikely that such a table would be very wide, so 4.3M
> > tuples would probably only amount to a few hundred MB of data. It's also
> > possible that this list could be vacuumed by whatever the regular vacuum
> > process is for the table.
>
> I think that keeping such list as part the table at well defined
> location (like pages from N to M) is the best strategy, as it will
> automatically make all new tuples available to parallel processes and
> avoids both duplicate storage as well as the the need for changing
> insert/update code.

There's one thing I hate about that idea though: good luck trying to
move those tuples somewhere else after the index build is done and you
now want to shrink the table back down to a more normal size. If we had
a better way to do that it would be much more palatable, but right now
on a heavily updated table this would result in a lot of bloat.

Along those lines, I've wondered if it makes sense to add more
flexibility in how free space is reclaimed in a table. One obvious
possibility is to have a strategy where new tuples will always look to
the FSM for space (instead of going into the current page if possible),
and the FSM will always hand out the earliest page in the table it has.
This mode would have the effect of moving tuples towards the front of
the table, allowing for space reclamation. A variation might be that
this mode will not effect tuples that are generated as part of an UPDATE
and are in the first x% of the table, since it doesn't make sense to
move a tuple from page 2 to page 1 in a 1000 page table.

Another possibility is to always go to the FSM and to have the FSM hand
back the page that is closest to the new tuple according to a certain
index. This would allow for ALTER TABLE CLUSTER to be much more
self-maintaining. The downside is that you'd have to do a lookup on that
index, but presumably if the index is important enough to cluster on
then it should be well-cached. There's probably some other tweaks that
could be done as well to make this more performant.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jan Wieck 2005-12-08 19:04:22 Re: Vertical Partitioning with TOAST
Previous Message Jim C. Nasby 2005-12-08 18:42:57 Re: Vertical Partitioning with TOAST