Re: Free Space Map data structure

From: PFC <lists(at)peufeu(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Hannu Krosing" <hannu(at)krosing(dot)net>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Free Space Map data structure
Date: 2008-04-10 08:58:42
Message-ID: op.t9d0j4jzcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> PFC wrote:
>> About the FSM :
>> Would it be possible to add a flag marking pages where all tuples
>> are visible to all transactions ? (kinda like frozen I think)
>
> Ah, the visibility map. That's another line of discussion. The current
> plan is to not tie that to the FSM, but implement it separately. There's
> some infrastructure changes that are needed for both, like the "map
> forks" (see recent FSM discussions), which is why we need to have a
> design for FSM as well before we start implementing the visibility map.
>
> It's definitely something I want to do for 8.4.

Ahhhhhh, yes, yes, yes ;) yes !!!!!

> Here's my rough plan:
> 1. Common "map forks" support
> 2. Rewrite FSM
> 3. Implement visibility map, to allow partial vacuums
> 4. Implement index-only scans, using the visibility map.

Throwing another idea that is related to partial vacuums (perhaps ?):
Consider a table that is often inserted (archive, forum posts, log,
whatever), we want to CLUSTER it.
In that case it would be beneficial to only cluster the "tail" of the
table, where all the recently inserted rows are.

Example :
- Table is clustered.
- Insert rows in random order, update some, delete some, etc, supposing
inserts happen at the end
- Table now looks like "head":[clustered part with some holes] plus
"tail":[rows in random order]
- As long as the "tail" fits in disk cache, the random order isn't a
problem.
- So, when the "tail" reaches a certain size :
- Grab it, sort it by cluster order, write it again in the heap
- Update the indexes in a manner similar to VACUUM (ie. bulk update)
- Table now looks like "head":[clustered part with some holes] plus
"tail":[clustered]
This does not remove the holes in the "head", but is this really a
problem ? In this usage scenario, I don't think so. Regular CLUSTER could
also be run, much less frequently than before, and it will also be much
faster since the rows are approximately in-order already.

This approach is complimentary to the "auto-cluster" approach where the
index is asked "where should I insert that row ?" (this will be used to
fill the holes). Auto-cluster will work well in tables that are updated
very often. But starting from an empty table, or an already clustered
table, or in a mostly-insert scenario, the index will have no idea where
to put that row...

The goodness of this approach is that
- As long as the "tail" fits in RAM, sorting it is going to be very fast
(unlike the current CLUSTER).
- Bulk index updates will also be fast as long as the list of changes to
apply to the index fits in memory.
- Therefore it will block the table for much less time than good old
CLUSTER.
- Therefore it will get more use ;)

How to make it non-locking ?
- Doing something like this in pseudo SQL :
INSERT INTO table SELECT * FROM (DELETE FROM table WHERE date > last time
we did this RETURNING *) ORDER BY cluster_columns;
VACUUM;

That is, take the "tail" of the table (as above), sort it, insert it back
in big chunks, and mark the old rows as deleted just like a regular delete
would have done. Then VACUUM.

In this case you now have :
"head":[clustered part with some holes] + "big hole" + "tail":[clustered
rows]

Is the "big hole" a problem ? Probably not, it will be marked as free
space by VACUUM and used for new inserts. A week later we get this :
"head":[clustered part with some holes] + [rows in random order]
+ "tail":[clustered rows]
Repeating the process above will make the "tail" grow and the "hole" will
stay more or less in the same place.

Another way to do it is to use partitions :
- "archive" table
- "current" table

Periodically the rows from "current" are transferred to the "archive"
table and sorted in the process. Then "current" is truncated.
This works, but it is blocking, and you have the overhead from
partitioning...

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Gregory Stark 2008-04-10 09:02:10 Re: Concurrent psql API
Previous Message Csaba Nagy 2008-04-10 08:31:54 Re: [OT] Concurrent psql API