Re: On-disk bitmap index patch

From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Jie Zhang <jzhang(at)greenplum(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Josh Berkus <josh(at)agliodbs(dot)com>, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org, Luke Lonergan <LLonergan(at)greenplum(dot)com>
Subject: Re: On-disk bitmap index patch
Date: 2006-07-28 17:17:14
Message-ID: 20060728171714.GR66525@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jul 27, 2006 at 09:13:21AM -0700, Jie Zhang wrote:
> On 7/26/06 11:50 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > "Jie Zhang" <jzhang(at)greenplum(dot)com> writes:
> >> On 7/26/06 10:14 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >>> ... A nonuniform distribution would probably mean that some
> >>> of the bitmaps compress better-than-expected and others worse. I have
> >>> no idea how to model that and guess what the overall result is ...
> >
> >> The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
> >> Wu et al gave an approximate answer for this question. Assume that there are
> >> c distinct values. Let the i-th value has a probability of p_i, the number
> >> of rows r, and the word size w. then the total size of the compressed bitmap
> >> index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
> >> both \sum's, i is from 1 to c.
> >
> > Hm, but that's still begging the question no? It's still assuming that
> > any one value is uniformly distributed. ISTM the cases that would break
> > my simplistic calculation involve clustering of particular values, such
> > that some areas of the bitmap are all-zero while other areas have lots
> > of ones.
>
> Yes, you are right -- each value is still uniformly distributed. But this
> will be the worst case in terms of the size of a bitmap vector. As for how
> to model the size of a bitmap vector for an non-uniformly distributed value,
> that's a good question. I don't really know. But we do know the best case
> and the worse case.

If the usefulness of bitmap indexes is still in doubt, could someone at
Greenplum provide data from actual data warehouses from actual
customers?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim C. Nasby 2006-07-28 17:38:41 Re: The vacuum-ignore-vacuum patch
Previous Message Jim C. Nasby 2006-07-28 17:14:49 Re: Hash indexes (was: On-disk bitmap index patch)