Re: Cluster using tablespaces?

From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Rainer Bauer <usenet(at)munnin(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Cluster using tablespaces?
Date: 2007-12-02 13:56:27
Message-ID: 20071202135627.GA6693@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Rainer Bauer wrote:
> Alvaro Herrera wrote:

> >It has been theorized that cluster would be faster in general if instead
> >of doing an indexscan we would instead use a seqscan + sort step. It
> >would be good to measure it.
>
> Could a reindex on the clustered index speed up the clustering (when executed
> immediatelly before the cluster command)? As I understand it, this index is
> used to fetch the table data in the correct order. Or is most of the time
> spend fetching the table data?

I haven't measured it, but my guess is that most of the time is in
fetching heap pages in random order.

> Also, would it make sense to increase <shared_buffers> for the cluster
> operation. This is set to 32MB here on my Windows box as was recommended.

Not sure. In general yes, but on Windows things are different.

> >> >For btree indexes, there is a temporary copy of the index data, which
> >> >will go wherever you have arranged for temp files to go. (I think that
> >> >easy user control of this may be new for 8.3, though.)
> >>
> >> Could you give me a hint where that would be on Windows? I guess this might be
> >> worth a try since there are a couple of btree indexes in the database.
> >
> >I think Tom is referring to the new temp_tablespaces config variable.
>
> I moved the pgsql_tmp directory to another disk, but that didn't speed up the
> cluster command.

Probably most of the time is going into creating the new table then.

If you are looking for a short-term solution to your problem, maybe the
best is to follow the recommendation on CLUSTER ref page:

There is another way to cluster data. The CLUSTER command
reorders the original table by scanning it using the index you
specify. This can be slow on large tables because the rows are
fetched from the table in index order, and if the table is
disordered, the entries are on random pages, so there is one
disk page retrieved for every row moved. (PostgreSQL has a
cache, but the majority of a big table will not fit in the
cache.) The other way to cluster a table is to use

CREATE TABLE newtable AS
SELECT * FROM table ORDER BY columnlist;

which uses the PostgreSQL sorting code to produce the desired
order; this is usually much faster than an index scan for
disordered data. Then you drop the old table, use ALTER TABLE
... RENAME to rename newtable to the old name, and recreate the
table's indexes. The big disadvantage of this approach is that
it does not preserve OIDs, constraints, foreign key
relationships, granted privileges, and other ancillary
properties of the table — all such items must be manually
recreated. Another disadvantage is that this way requires a sort
temporary file about the same size as the table itself, so peak
disk usage is about three times the table size instead of twice
the table size.

--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2007-12-02 18:22:50 Re: postgresql in ramdisk
Previous Message rokj 2007-12-02 13:35:39 One or more tables?