On Clusters

From: Mark Kirkwood <markir(at)i4free(dot)co(dot)nz>
To: pgsql-sql(at)postgresql(dot)org
Subject: On Clusters
Date: 2001-03-07 05:57:47
Message-ID: 01030718574700.02422@spikey.slithery.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

A previous posting mentioning clusters prompted me to revist some earlier
tests done on clustered and unclustered data.

It appears that currently ( 7.1beta5 ) the optimizer is unaware of any
clustering on a table - how important is that ?

To answer this question I used by "pet" data warehouse tables :

Table "fact1" 3000000 rows ~ 350Mb
Attribute | Type | Distribution
-----------+---------+-------------
d0key | integer | 3000 distinct values 0-9000 clustered
d1key | integer |
val | integer |
filler | text |
Index: fact1_pk ( d0key,d0key ) cluster "key"

Table "fact2" 3000000 rows ~ 350Mb
Attribute | Type | Distribution
-----------+---------+-------------
d0key | integer | 3000 distinct values 0-9000 uniformly spread
d1key | integer |
val | integer |
filler | text |
Index: fact2_pk ( d0key,d0key )

The sample queries used to shed some light on the nature of the difference
are : firstly the index scan -

explain select count(*)
from fact1 where d0key between 200 and 279;

Aggregate (cost=58664.62..58664.62 rows=1 width=0)
-> Index Scan using fact1_pk on fact1 (cost=0.00..58598.72 rows=26360
width=0)

and the sequential scan -

explain select count(*)
from fact1 where d0key between 200 and 280;

Aggregate (cost=59020.73..59020.73 rows=1 width=0)
-> Seq Scan on fact1 (cost=0.00..58954.00 rows=26693 width=0)

and analogous versions for fact2 ( with the same execution plan )

On the unclustered table fact2 the optimizer correctly assess the time to
switch between an index scan and an sequential scan - both queries take about
30 s.

However on the clustered table fact1, the (same) choice results in a jump
from1s for the index scan to 30s for the sequential scan.

(this was the guts of the previous research... bear with me those of you who
read the last article )

So how long should an index scan be used for ?, some experimentation led me to
adjust the "where" clause in my queries to

where d0key between 0 and 4500

This produces a query plan of :

Aggregate (cost=62692.75..62692.75 rows=1 width=0)
-> Seq Scan on fact1 (cost=0.00..58954.00 rows=1495498 width=0)

coercing the optimizer with a brutal set of cpu_tuple_cost = 0.4 gives :

Aggregate (cost=868673.82..868673.82 rows=1 width=0)
-> Index Scan using fact1_pk on fact1 (cost=0.00..864935.08 rows=1495498
width=0)

(note that these scan 1500000 rows, ie. half the data )

Testing these queries on fact1 gives run times af about 35s for both -

thus it is worthwhile to keep using index scans of upto 50%
of the ( clustered ) table data.

I found this result interesting, as I was thinking more like 15-20% of the
table data would be the limit.

The answer to the original question ( finally ) is "it is pretty important",
as knowlege of the clustering drastically changes the optimal access path.

So what to do if you know you have clustered data ? ( either via cluster or
"it just happens to go in that way" ).

Tampering with the various *cost type parameters to encourage index scans
seems to be the only solution (other sugestions welcome here), but tends to
be too global in effect ( for example trying the previous query on
(unclustered ) fact2 with cpu_tuple_cost=0.4 takes more that 300s - I got
tired of waiting...) .

So be careful out there...

Cheers

Mark

Browse pgsql-sql by date

  From Date Subject
Next Message Jens Hartwig 2001-03-07 07:18:19 AW: Problems with RULE
Previous Message Bruce Momjian 2001-03-07 04:56:02 Re: Query Limitations