Re: Reviewing new index types (was Re: [PATCHES] Updated bitmap indexpatch)

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reviewing new index types (was Re: [PATCHES] Updated bitmap indexpatch)
Date: 2007-09-26 08:36:30
Message-ID: 200709260836.l8Q8aUd11470@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches


This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Sat, 2007-07-21 at 12:20 +0100, Simon Riggs wrote:
>
> > I'd like to help where I can if nobody else is currently doing this. I
> > would focus initially on some analysis of the various use cases to give
> > a better view on what we would need B-tree, clustered indexes and bitmap
> > indexes to do for us.
>
> I've done some further analysis of bitmap indexes in preparation for a
> comparison with clustered indexes (GIT), to help understand the use
> cases for each.
>
> Overall, my conclusion is that BMI and GIT have separate use cases,
> almost opposite use cases or at least orthogonal ones. I would
> eventually like both. BMI optimises for high numbers of rows per value,
> whilst GIT optimises for clustering of values. BMI is not useful at all
> for PKs, whilst GIT is specifically designed to handle them. Both handle
> INSERTs well, though GIT handles growing numbers of values easily, BMI
> prefers to keep the distribution more constant. GIT needs HOT to
> continue to operate effectively for long periods, whereas BMI doesn't
> seem to handle UPDATEs well at all (but more testing required on that
> one).
>
> ---
>
> Neither the latest bitmap index nor the latest GIT patch applied
> cleanly. The bitmap patch was close, but GIT needs an update yet to
> integrate Alexey's recent work.
>
> My test case was a table with 10 million rows, with columns with varying
> numbers of unique values. So Ndistinct = 100 means 100,000 rows per
> value.
>
> BITMAP INDEXES
>
> Ndistinct Best time Size in blocks
> 1 10.6s 100
> 10 10.4s 102
> 100 11.7s 2002
> 1000 15.1s 6006
> 10000 19.8s 10046
> 100000 82.1s 100442
> 1000000 - >450000
>
> Size exactly equivalent for both Integer and Text (same values). Build
> time was similar also.
>
> The test for 1 million distinct values didn't return after over 4 CPU
> minutes expended with the disk going crazy. After a number of minutes I
> decided to cancel the index build. Multiple cancels didn't stop the
> build, so after some more time I decided to kill it, which then crashed
> the server. Automatic restart crashed as well with a "could not find
> transaction id 0" error. Clearly some WAL-weirdness to investigate...
>
> Overall, I'd have to say that's quite enough for me to say bitmap is not
> quite ready yet without clear health warnings. I had hopes...
>
> B-TREE INDEXES (Integers)
>
> Rows/value Best time Size in blocks
> 10000000 49s 21899
> 1000000 49s 21899
> 100000 49s 21899
> 10000 47s 21899
> 1000 43s 21899
> 100 38s 21899
> 10 38s 21899
> 1 33s 21899
>
> Build time for Integers shown. Build time for Text ~x5-6 times as long.
>
> Testing against equivalent b-tree builds, the fastest b-tree build I
> could get was 33s on a unique integer index. So BMI build time is
> certainly optimised for low numbers of distinct values, but doesn't have
> any optimisation for when the BMI is built on a poor candidate column.
> GIT does degrade down to a normal b-tree when clustering isn't
> sufficient to give reduction in index size.
>
> The cross-over point was between 10^4 and 10^5 distinct values for both
> size and build time; on that test around 100-1000 rows per value. So
> BMIs are probably still useful with varying number of rows per value,
> but overall high Ndistinct proves inefficient in both build time and
> space allocation. This isn't such a surprise since we know that b-tree
> build uses a sort-based plan whereas BMI uses a hash based plan; neither
> will win all of the time, we know that from the executor.
>
> GIT works well even with unique indexes, since each grouped tuple covers
> a range of values. I'll re-run the tests when I can to get timings. GIT
> can compress typically down to 1-5% with clustered data, not quite as
> good as bitmap's 0.5% best.
>
> GIT's design was to have an index that was tuned for clustered data, yet
> degrades cleanly to a standard b-tree when conditions are not right.
> This makes me think that a hybrid b-tree should be possible, even
> desirable. When the data is clustered, use the grouping technique to
> reduce he number of tuples stored and when the data is highly non-unique
> use the bitmap technique to reduce numbers of tuples. Using both
> techniques in the same index would offer even wider flexibility, since
> we'd be able to cater for real-world data more easily. Both GIT and BMI
> use bitmaps, just in different ways.
>
> --
> Simon Riggs
> EnterpriseDB http://www.enterprisedb.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2007-09-26 08:37:33 Re: pgcrypto & strong ciphers limitation
Previous Message Bruce Momjian 2007-09-26 08:34:27 Re: Postgresql.conf cleanup

Browse pgsql-patches by date

  From Date Subject
Next Message Bruce Momjian 2007-09-26 08:38:54 Re: allow CSV quote in NULL
Previous Message Bruce Momjian 2007-09-26 08:32:21 Re: Load Distributed Checkpoints, final patch