Re: On-disk bitmap index patch

From: "Jie Zhang" <jzhang(at)greenplum(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
Subject: Re: On-disk bitmap index patch
Date: 2006-07-27 06:21:26
Message-ID: C0EDA976.90E7%jzhang@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/26/06 10:14 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
>> An obvious deduction is that the TPCH dataset is much more amenable to
>> run compression than my synthetic Zipfian data was. The interesting
>> question is how well "real" datasets are run compressable,
>
> Yeah --- the back-of-the-envelope calculations I was making presupposed
> uniform random distribution, and we know that's often not realistic for
> real datasets. A nonuniform distribution would probably mean that some
> of the bitmaps compress better-than-expected and others worse. I have
> no idea how to model that and guess what the overall result is ...
>

The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
Wu et al gave an approximate answer for this question. Assume that there are
c distinct values. Let the i-th value has a probability of p_i, the number
of rows r, and the word size w. then the total size of the compressed bitmap
index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
both \sum's, i is from 1 to c.

The constraint for this equation is \sum(p_i)=1. Therefore, when all p_i are
equal, or the attribute has randomly distributed values, the size of the
bitmap index is the largest.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-07-27 06:50:35 Re: On-disk bitmap index patch
Previous Message Tom Lane 2006-07-27 05:14:43 Re: On-disk bitmap index patch