Re: Multiple column index usage question

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Multiple column index usage question
Date: 2007-01-19 23:47:48
Message-ID: 45B158A4.9040301@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

> Rather simple question, of which I'm not sure of the answer.
>
> If I have a multiple column index, say:
> Index index1 on tableA (foo,bar)
>
> and I then:
> Select * from "tableA" where foo = <some value>
>
> Will index1 be used, or am I looking at a seqscan in all circumstances?
>
> TIA
> -jan m

Rather difficult to answer this question without knowledge of the data,
especially it's statistical properties, but yes - the index might been
used if there's enough variability in the data (for the particular value).

But that's true for all indexes, single as well as multi-column ones.
For example imagine a query

Select * from "tableA" where foo = 'x';

where 'x' is a very uncommon value (for example less than 1% of the rows
has this value). In that case the index definitely will be used (unless
some really stupid mistake - for example different data types - prevents
it's usage). On the other side, imagine the value 'x' is very common
(for example more than 10% has this value). In that case it's very
unlikely the index will be used as the sequential scan of the whole
table will most likely be more efficient).

In the new releases (definitely 8.1, I'm not sure about 8.0) the index
might be used even for queries related to 'bar' column alone, though it
would be a little less efficient as for 'foo' (or even both columns).

All that means you can replace several single-column indexes with one
multi-column index, and still use that index for queries with only some
of the indexed columns, but there are differences in efficiency. Generally:

0) Multi-column indexes are most efficient when all the columns are
used in the query.

1) The more columns are used, the more efficient the index usage is.

2) Columns 'from the beginning' are more efficiently processed than
the columns 'from the end' (so the most often used column should
be placed at the beginning).

3) More variability in the data means more efficient index (so the
most variable columns should be placed at the beginning).

This is somehow contradictory, especially the rules (2) and (3), and you
have to reason (and test) carefully about the order in the index, as you
want place the most often queried at the beginning of the list, but
there may be columns with more variability.

Another thing you have to take into account is sorting - that's another
area of indexing, especially with multi-column indexes.

Well, somehow long answer for a relatively short question ... sorry for
that.

Tomas

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Holger Hoffstaette 2007-01-20 00:08:50 Re: Spam from EnterpriseDB?
Previous Message Ron Johnson 2007-01-19 23:33:47 Re: Multiple column index usage question