Re: On-disk bitmap index patch

From: "Jie Zhang" <jzhang(at)greenplum(dot)com>
To: "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Josh Berkus" <josh(at)agliodbs(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: On-disk bitmap index patch
Date: 2006-07-27 05:09:34
Message-ID: C0ED989E.90E0%jzhang@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 7/26/06 8:55 PM, "Luke Lonergan" <llonergan(at)greenplum(dot)com> wrote:

> Tom,
>
> On 7/26/06 7:26 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> I wonder
>> whether they oughtn't use 16-bit instead of 8-bit HRL_WORDs
>
> We tried variations from 8-bit to 32-bit and came to the conclusion that
> 8-bit was a better match for the more general case. For the moment I forget
> exactly why (maybe Jie or Ayush will recall). It's a #define I believe, so
> you can play with it to find out.
>
> There's a lot more to be done with this - I think we've got some big gains
> ahead.

For low-cardinality cases, 8-bit word size is usually better than 16-bit and
32-bit. The reason is that in this case, the compression is better in 8-bit
than in 16-bit or 32-bit. But for high-cardinality cases, like 10,000,
16-bit is better. That's because a compressed 8-bit can only represent at
most 2^7*8=1024 bits. So every 10,000 bits requires about 10 bytes. However,
a compressed 16-bit can represent 2^15*16=524288 bits. We only need 2 bytes.

I have been thinking about this recently -- maybe we can support different
word sizes for different cardinalities.

>
> BTW - lots of excitement here at OSCON about bitmap index, and there's a
> fellow who is using the current Bizgres version, but wants *higher*
> cardinality support, so I discussed Jie's thoughts about implementing
> binning on values, basically combining bit vectors into value bins as the
> cardinality increases.

Yes, that's the basic idea. Essentially we want to limit the number of bins
into an ideal value so that (1) the size of the bitmap index can be small;
(2) this will still guarantee good query performance. The details still need
to be working out.

>
> I am still puzzled why our index creation time increases so dramatically as
> we approach cardinality 10,000. I know that we found that index page
> traversal for append began to take a lot of time as we started to increase
> the number of bitvector pages - we had talked about aggregating the append
> operations into groups before they were implemented to sequentialize the
> access.
>

I believe that this is because of 8-bit word size. We can try to increase it
to 16-bit, and see how the result looks like.

Thanks,
Jie

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2006-07-27 05:14:43 Re: On-disk bitmap index patch
Previous Message Tom Lane 2006-07-27 05:03:38 Re: GUC with units, details