Re: no index-usage on aggregate-functions?

From: "Scott Marlowe" <smarlowe(at)qwest(dot)net>
To: "Harald Lau (Sector-X)" <harald(at)sector-x(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: no index-usage on aggregate-functions?
Date: 2004-06-29 17:21:58
Message-ID: 1088529718.12350.35.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, 2004-06-29 at 02:46, Harald Lau (Sector-X) wrote:
> @Chris:
>
> > > SELECT count(*) FROM the_table
> > > => Seq scan -> takes about 12 sec
> > This cannot be made O(1) in postgres due to MVCC. You just have to live
> > with it.
>
> bad news
> BTW: in this case you could workaround
> select reltuples from pg_class where relname='the_table'
> (yes, I know: presumes a regular vacuum analyse)

Note that there ARE other options. While the inability to provide a
speedy count is a "cost" of using an MVCC system, the ability to allow
thousands of readers to run while updates are happening underneath them
more than makes up for the slower aggregate performance.

The other options to this problem involve maintaining another table that
has a single (visible) row that is maintained by a trigger on the main
table that fires and updates that single row to reflect the count of the
table. This is costly on updates, but may be worth doing for certain
situations. Personally, I haven't had a great need to do a count(*) on
my tables that much. And on large tables, approximations are usually
fine.

> > Average and sum can never use an index AFAIK, in any db server. You
> > need information from every row.
>
> Take a look at the SQLSrv-pendant:
> create index x_1 on the_table (num_found)
> select avg(num_found) from the_table
> -> Index Scan(OBJECT:([midata].[dbo].[THE_TABLE].[x_1])
>
> (I'm not sure what Oracle does - have to re-install it first ...)

There's a good chance Oracle can use the index too. That's because both
Oracle is still a row locked database at heart. It's MVCC system sits
on top of it in roll back segments. So, the main store is serialized
and can be indexed, while the updates live in the rollback segment.

This, however, is not paradise. This limits Oracle's performance for
things like long running transactions and makes it slower as the amount
of information in the rollback segment grows. Meanwhile, PostgreSQL
uses an in store MVCC mechanism. This system means that all index
accesses must then hit the actual MVCC storage, since indexes aren't
easily serialized.

> @Scott:
> > Yes, you're expecting an MVCC database to behave like a row locking
> > database.
>
> hmmmm...
> So, it seems that PG is not soooo well suited for a datawarehouse and/or performing extensive statistics/calculations/reportings on large tables, is it?

On the contrary, it makes it GREAT for datawarehousing. Not because any
one child process will be super fast, but because ALL the child
processes will run reasonably fast, even under very heavy read and write
load. Note that if you've got the memory for the hash agg algo to fire
into shared memory, it's pretty darned fast now, so if the data (mostly)
fit into kernel cache you're gold. And 12 gig Intel boxes aren't that
expensive, compared to an Oracle license.

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Laurent Martelli 2004-06-29 17:30:29 Re: VidéoProj -> RMLL
Previous Message Mohan A 2004-06-29 16:54:59 suggestions to improve performace