Re: Indexing a Boolean or Null column?

From: Christopher Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Indexing a Boolean or Null column?
Date: 2004-01-04 21:39:37
Message-ID: m3brpjjsva.fsf@wolfe.cbbrowne.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

After a long battle with technology, dante(at)lorenso(dot)com ("D. Dante Lorenso"), an earthling, wrote:
> I've been debating with a collegue who argues that indexing a
> boolean column is a BAD idea and that is will actually slow
> down queries.

No, it would be expected to slow down inserts, but not likely queries.

> Will an index on the 'data_is_active' column be used or work
> as I expect? I'm assuming that I may have a million entries
> sharing the same 'data_lookup_key' and I'll be using that to
> search for the active version of the row.

> SELECT *
> FROM table
> WHERE data_lookup_key = 'pear'
> AND data_is_active IS TRUE;
>
> Does it make sense to have an index on data_is_active?

Not really.

> Now, I've read that in some databases the index on a column that has
> relatively even distribution of values over a small set of values
> will not be efficient.

The problem is (and this is likely to be true for just about any
database system that is 'page-based,' which is just about any of them,
these days) that what happens, with the elements being so pervasive,
throughout the table, queries will be quite likely to hit nearly every
page of the table.

If you're hitting practically every page, then it is more efficient to
just walk thru the pages (Seq Scan) rather than to bother reading the
index.

The only improvement that could (in theory) be made is to cluster all
the "true" values onto one set of pages, and all the "false" ones onto
another set of pages, and have a special sort of index that knows
which pages are "true" and "false". I _think_ that Oracle's notion of
"cluster tables" function rather like this; it is rather debatable
whether it would be worthwhile to do similar with PostgreSQL.

A way of 'clustering' with PostgreSQL might be to have two tables
table_active
and
table_inactive
where a view, representing the 'join' of them, would throw in the
'data_is_active' value. By clever use of some rules/triggers, you
could insert into the view, and have values get shuffled into the
appropriate table.

When doing a select on the view, if you asked for "data_is_active is
TRUE", the select would only draw data from table_inactive, or
vice-versa.

Unfortunately, sometimes the query optimizer may not be clever enough
when working with the resulting joins, though that may just be a
Simple Matter Of Programming to make it more clever in future versions.
:-)
--
output = reverse("gro.gultn" "@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/spreadsheets.html
Rules of the Evil Overlord #136. "If I build a bomb, I will simply
remember which wire to cut if it has to be deactivated and make every
wire red." <http://www.eviloverlord.com/>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message John Siracusa 2004-01-04 23:36:16 Use my (date) index, darn it!
Previous Message Tom Lane 2004-01-04 07:48:18 Re: Indexing a Boolean or Null column?