Re: Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: <pgsql-performance(at)postgresql(dot)org>, "Mark Cotner" <mcotner(at)yahoo(dot)com>
Cc: "Christopher Browne" <cbbrowne(at)acm(dot)org>
Subject: Re: Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables
Date: 2004-09-13 22:07:35
Message-ID: NOEFLCFHBPDAFHEIPGBOCEFPCEAA.simon@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


Mark,

I thought some additional comments on top of Christopher's excellent notes
might help you.

> Christopher Browne
> The world rejoiced as Mischa Sandberg
> <ischamay(dot)andbergsay(at)activestateway(dot)com> wrote:
> > Mark Cotner wrote:
> >> Requirements:
> >> Merge table definition equivalent. We use these
> >> extensively.
>
> > Looked all over mysql.com etc, and afaics merge table is indeed
> > exactly a view of a union-all. Is that right?
>

PostgreSQL's functionality is in many ways similar to Oracle Partitioning.

Loading up your data in many similar tables, then creating a view like:

CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS
SELECT 200409130800, col1, col2, col3... FROM table200409130800
UNION ALL
SELECT 200409131000, col1, col2, col3... FROM table200409131000
UNION ALL
SELECT 200409131200, col1, col2, col3... FROM table200409131200
...etc...

will allow the PostgreSQL optimizer to eliminate partitions from the query
when you run queries which include a predicate on the partitioning_col, e.g.

select count(*) from bigtable where idate >= 200409131000

will scan the last two partitions only...

There are a few other ways of creating the view that return the same answer,
but only using constants in that way will allow the partitions to be
eliminated from the query, and so run for much longer.

So you can give different VIEWS to different user groups, have different
indexes on different tables etc.

However, I haven't managed to get this technique to work when performing a
star join to a TIME dimension table, since the parition elimination relies
on comparison of constant expressions. You'll need to check out each main
join type to make sure it works for you in your environment.

> > PG supports views, of course, as well (now) as tablespaces, allowing
> > you to split tables/tablesets across multiple disk systems. PG is
> > also pretty efficient in query plans on such views, where (say) you
> > make one column a constant (identifier, sort of) per input table.
>
> The thing that _doesn't_ work well with these sorts of UNION views are
> when you do self-joins. Supposing you have 10 members, a self-join
> leads to a 100-way join, which is not particularly pretty.
>

Well, that only happens when you forget to include the partitioning constant
in the self join.

e.g. select count(*) from bigtable a, bigtable b where a.idate =
.idate; --works just fine

The optimizer really is smart enough to handle that too, but I'm sure such
large self-joins aren't common for you anyhow.

> I'm quite curious as to how MySQL(tm) copes with this, although it may
> not be able to take place; they may not support that...
>

It doesn't, AFAIK.

> Christopher Browne wrote
> A long time ago, in a galaxy far, far away, mcotner(at)yahoo(dot)com
> (Mark Cotner) wrote:
> > Agreed, I did some preliminary testing today and am very impressed.
> > I wasn't used to running analyze after a data load, but once I did
> > that everything was snappy.
>
> Something worth observing is that this is true for _any_ of the
> database systems supporting a "cost-based" optimization system,
> including Oracle and DB2.

Agreed. You can reduce the time for the ANALYZE by ignoring some of the
(measures) columns not used in WHERE clauses.

Also, if you're sure that each load is very similar to the last, you might
even consider directly updating pg_statistic rows with the statistical
values produced from an earlier ANALYZE...scary, but it can work.

To create a set of tables of > 600Gb, you will benefit from creating each
table WITHOUT OIDS.

Hope some of that helps you...

Best Regards, Simon Riggs

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Stephen Crowley 2004-09-14 00:51:22 Large # of rows in query extremely slow, not using index
Previous Message Tom Lane 2004-09-13 21:18:26 Re: Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options