Re: Bitmap index thoughts

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Jie Zhang" <jzhang(at)greenplum(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bitmap index thoughts
Date: 2006-12-27 19:24:46
Message-ID: 1167247486.3783.470.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2006-12-27 at 22:16 +1100, Gavin Sherry wrote:

> > But actually I'm not convinced we need to worry about efficient storage
> > of small bitmaps at all. The typical use case for bitmap indexes is
> > large tables with small number of distinct values, and the problem
> > doesn't really arise in that scenario. Let's keep it simple for now, we
> > can enhance it in later releases.
>
> The scenario I'm concerned about is where a sales data base, say, has
> 100,000 products. However, only 500 or 1000 products are popular. They
> dominate, say >99% of the sales. The other 99,900 products consume a
> little bit over 8K each for very little benefit :-(.
>
> This is pretty contrived but it seem real world enough...

Well, that seems to me to be the typical case. It's called a Zipf
Distribution and applies to categorical data everywhere.

The main question is how you design your database. We might think of
indexing the product group (a higher level of the Product Dimension),
since these tend to have a low number of values but these may have been
normalised (or "placed into a Dimension table"). This might leave you
with only the ProductId values in the large Fact table, which as Gavin
points out, may be sparsely populated.

Another idea might to store the first N heap pointers in the auxiliary
heap, rather than allocating them a whole page for the first value. That
would be even more space efficient than allocating a fixed size part of
a page. At least that would provide some utility for that part of the
storage mechanism.

However, I think Heikki's KISG approach seems good for now. The
infrequent values will probably be infrequently accessed, so we may
never need to read them at all (wishful thinking?).

If we can get something ready for commit soon, that will leave lots of
time before 8.3 freeze to measure where things can be further improved
in terms of space and performance.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-12-27 19:30:10 pgindent infelicity
Previous Message Tom Lane 2006-12-27 19:15:02 Re: [BUGS] BUG #2846: inconsistent and confusing handling of