Re: On-disk bitmap index patch

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jie Zhang" <jzhang(at)greenplum(dot)com>
Cc: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
Subject: Re: On-disk bitmap index patch
Date: 2006-07-27 06:50:35
Message-ID: 748.1153983035@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Jie Zhang" <jzhang(at)greenplum(dot)com> writes:
> On 7/26/06 10:14 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> ... A nonuniform distribution would probably mean that some
>> of the bitmaps compress better-than-expected and others worse. I have
>> no idea how to model that and guess what the overall result is ...

> The paper "Optimizing Bitmap Indices With Efficient Compression" by Kesheng
> Wu et al gave an approximate answer for this question. Assume that there are
> c distinct values. Let the i-th value has a probability of p_i, the number
> of rows r, and the word size w. then the total size of the compressed bitmap
> index is about (N/(w-1))(c- \sum(1-p_i)^(2w-2) - \sum(p_i)^(2w-2)), where in
> both \sum's, i is from 1 to c.

Hm, but that's still begging the question no? It's still assuming that
any one value is uniformly distributed. ISTM the cases that would break
my simplistic calculation involve clustering of particular values, such
that some areas of the bitmap are all-zero while other areas have lots
of ones.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Glaesemann 2006-07-27 07:49:31 Re: GUC with units, details
Previous Message Jie Zhang 2006-07-27 06:21:26 Re: On-disk bitmap index patch