Skip site navigation (1) Skip section navigation (2)

Reviewing new index types (was Re: [PATCHES] Updated bitmap indexpatch)

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Reviewing new index types (was Re: [PATCHES] Updated bitmap indexpatch)
Date: 2007-07-23 21:02:31
Message-ID: 1185224551.4284.373.camel@ebony.site (view raw or flat)
Thread:
Lists: pgsql-hackerspgsql-patches
On Sat, 2007-07-21 at 12:20 +0100, Simon Riggs wrote:

> I'd like to help where I can if nobody else is currently doing this. I
> would focus initially on some analysis of the various use cases to give
> a better view on what we would need B-tree, clustered indexes and bitmap
> indexes to do for us.

I've done some further analysis of bitmap indexes in preparation for a
comparison with clustered indexes (GIT), to help understand the use
cases for each.

Overall, my conclusion is that BMI and GIT have separate use cases,
almost opposite use cases or at least orthogonal ones. I would
eventually like both. BMI optimises for high numbers of rows per value,
whilst GIT optimises for clustering of values. BMI is not useful at all
for PKs, whilst GIT is specifically designed to handle them. Both handle
INSERTs well, though GIT handles growing numbers of values easily, BMI
prefers to keep the distribution more constant. GIT needs HOT to
continue to operate effectively for long periods, whereas BMI doesn't
seem to handle UPDATEs well at all (but more testing required on that
one).

---

Neither the latest bitmap index nor the latest GIT patch applied
cleanly. The bitmap patch was close, but GIT needs an update yet to
integrate Alexey's recent work.

My test case was a table with 10 million rows, with columns with varying
numbers of unique values. So Ndistinct = 100 means 100,000 rows per
value.

BITMAP INDEXES

Ndistinct	Best time	Size in blocks
1		10.6s		100
10		10.4s		102
100		11.7s		2002
1000		15.1s		6006
10000		19.8s		10046
100000		82.1s		100442
1000000		-		>450000

Size exactly equivalent for both Integer and Text (same values). Build
time was similar also.

The test for 1 million distinct values didn't return after over 4 CPU
minutes expended with the disk going crazy. After a number of minutes I
decided to cancel the index build. Multiple cancels didn't stop the
build, so after some more time I decided to kill it, which then crashed
the server. Automatic restart crashed as well with a "could not find
transaction id 0" error. Clearly some WAL-weirdness to investigate...

Overall, I'd have to say that's quite enough for me to say bitmap is not
quite ready yet without clear health warnings. I had hopes...

B-TREE INDEXES (Integers) 

Rows/value	Best time	Size in blocks
10000000	49s		21899
1000000		49s		21899
100000		49s		21899
10000		47s		21899
1000		43s		21899
100		38s		21899
10		38s		21899
1		33s		21899

Build time for Integers shown. Build time for Text ~x5-6 times as long.

Testing against equivalent b-tree builds, the fastest b-tree build I
could get was 33s on a unique integer index. So BMI build time is
certainly optimised for low numbers of distinct values, but doesn't have
any optimisation for when the BMI is built on a poor candidate column.
GIT does degrade down to a normal b-tree when clustering isn't
sufficient to give reduction in index size.

The cross-over point was between 10^4 and 10^5 distinct values for both
size and build time; on that test around 100-1000 rows per value. So
BMIs are probably still useful with varying number of rows per value,
but overall high Ndistinct proves inefficient in both build time and
space allocation. This isn't such a surprise since we know that b-tree
build uses a sort-based plan whereas BMI uses a hash based plan; neither
will win all of the time, we know that from the executor.

GIT works well even with unique indexes, since each grouped tuple covers
a range of values. I'll re-run the tests when I can to get timings. GIT
can compress typically down to 1-5% with clustered data, not quite as
good as bitmap's 0.5% best.

GIT's design was to have an index that was tuned for clustered data, yet
degrades cleanly to a standard b-tree when conditions are not right.
This makes me think that a hybrid b-tree should be possible, even
desirable. When the data is clustered, use the grouping technique to
reduce he number of tuples stored and when the data is highly non-unique
use the bitmap technique to reduce numbers of tuples. Using both
techniques in the same index would offer even wider flexibility, since
we'd be able to cater for real-world data more easily. Both GIT and BMI
use bitmaps, just in different ways.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


In response to

Responses

pgsql-hackers by date

Next:From: Tom LaneDate: 2007-07-23 21:19:28
Subject: Re: Reviewing new index types (was Re: [PATCHES] Updated bitmap indexpatch)
Previous:From: Sibte AbbasDate: 2007-07-23 20:38:45
Subject: Re: [HACKERS] 8.2.4 signal 11 with large transaction

pgsql-patches by date

Next:From: Tom LaneDate: 2007-07-23 21:19:28
Subject: Re: Reviewing new index types (was Re: [PATCHES] Updated bitmap indexpatch)
Previous:From: Tom LaneDate: 2007-07-23 20:03:14
Subject: Re: COPYable logs

Privacy Policy | About PostgreSQL
Copyright © 1996-2014 The PostgreSQL Global Development Group