Here's the design of bitmap AM I'm planning to implement. I've discussed it
with Neil, but I'm willing to get more feedback on it.
There are going to be 1 metapage, 1 list of CTIDs (LOC), one list
of attribute values (LOV, including attributes for multi-column indexes) and a
bitmap for each entry in the LOV. Metapage will contain pointers to the LOC,
LOV will always start at page-1 and is organized as a 1-way chained list.
Neil suggested a very good way how to handle updates. Of course, it's not
necessary to strictly follow tuple layout in the table's heap, as I wanted
to do initially. All that's needed, is that bit positions in bitmaps would
be tied with CTID positions in LOC.
So, we can do simple insert (generally, that's how MVCC works for tuple
updates): when some tuple is updated, new CTID is added to the LOC and each
bitmap is extended by 1 bit.
On the other side (as Neil pointed), after VACUUM, we need to do some
maintenance of bitmap index to avoid situations when index contains duplicate
entries (in LOC and bitmaps) for the same CTID (value before marking tuple for
reuse and after actually reusing it). So far, the fastest way would be
recreating index, because the whole index designed for faster search/insert
operations and lacks effective reverse mapping (CTID->bit position->bit value)
List of CTIDs is organized into an array of extents. Each extent has 2**i
pages ('i' is extent number in array). All pages for extent are allocated at
once, ensuring their IDs are sequential. So, we need to store only
BufferNumber of the first page in extent. LOC pages contain array of
ItemPointerData, it's size is detected at compile time. So, CTID for given bit
position can be obtained via only one ReadBuffer() call.
LOV pages store arrays of value-descriptors, each descriptor has a pointer to
the head of value's bitmap. Bitmaps are organized as 1-way chained list,
bitmap contents is compressed using Word-Aligned Hybryd method (similar to
Neil suggested the other way of organizing bitmap storage: all bits for
given position are stored in one page (if it lacks space, new page is added
"at the bootom"), so we'll have a 2-dimensional bitmap storage.
This reduces amount of page reads during index scan, but for low-cardinality
indexes, we'll waste a lot of space per page, if each CTIDs slice is stored in
one page. On the other hand, it'll be hard to extend bitmap if we'll store
several CTID slices per page and some new value will be inserted (i.e. CTID
slice needs to be increased).
At the moment, I would implement the easiest method -- equality encoding (as
it's called in papers, bitmap per value). Neil's suggested techniques are
called range encoding. Josh Berkus on the irc suggested implementing several
encoding schemes as an option to the "create index" sql command.
What do you think?
I haven't looked at planner/executor yet.
If you have some questions, please, ask. Also, I'd like to tell that this is
my first time coding for PostgreSQL, thus I may use incorrect terminology.
Also, it takes much time to write anything, for the same reason. And yes,
I would really appreciate all kind of criticism on this design.
I've started to implement AM, I need to register am* functions, what OIDs
should use to register them in include/catalog/pg_proc.h?
Waiting for your feedback.
Victor Y. Yegorov
pgsql-hackers by date
|Next:||From: Tom Lane||Date: 2005-02-28 19:51:46|
|Subject: Re: snprintf causes regression tests to fail |
|Previous:||From: Nicolai Tufar||Date: 2005-02-28 19:42:37|
|Subject: Re: snprintf causes regression tests to fail|