Re: Why does a simple query not use an obvious index?

From: Greg Stark <gsstark(at)mit(dot)edu>
To: "Scott Marlowe" <smarlowe(at)qwest(dot)net>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Jack Kerkhof" <jack(dot)kerkhof(at)guest-tek(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Why does a simple query not use an obvious index?
Date: 2004-08-29 22:10:05
Message-ID: 874qmlk342.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


"Scott Marlowe" <smarlowe(at)qwest(dot)net> writes:

> PostgreSQL has a "generic" aggregate method. Imagine instead doing a
> select count(id1+id2-id3) from table where ... In that instance, it's
> not a simple shortcut to just grab the number of rows anymore. Since
> PostgreSQL uses a generic aggregate method that can be expanded by the
> user with custom aggregates et. al., it has no optimizations to make
> simple count(*) fast, like many other databases.

People expect count(*) _without a where clause_ to be cached in a single
global variable. Postgres can't do this, but the reason has everything to do
with MVCC, not with postgres's generalized aggregates. Because of MVCC
Postgres can't just store a single cached value, because there is no single
cached value. It would have to store a complete history back to the oldest
extant transaction.

However in this case the user has a where clause. No database is going to
cache values of count(*) for random where clauses. But that doesn't stop
Postgres from using an index to fetch the records.

> > > select somefield from sometable where timestampfield > now()-'60
> > > seconds'::interval
> > >
> > > and count the number of returned rows. If there's a lot, it won't be
> > > any faster, if there's a few, it should be a win.
> >
> > Why would this ever be faster? And how could postgres ever calculate that
> > without doing a sequential scan when count(*) would force it to do a
> > sequential scan?
>
> Because, count(*) CANNOT use an index. So, if you're hitting, say,
> 0.01% of the table (let's say 20 out of 20,000,000 rows or something
> like that) then the second should be MUCH faster.

I think you've applied these past discussions and come up with some bogus
conclusions.

The problem here has nothing to do with the count(*) and everything to do with
the WHERE clause. To fetch the records satisfying that where clause postgres
has to do exactly the same thing regardless of whether it's going to feed the
data to count(*) or return some or all of it to the client.

If postgres decides the where clause isn't selective enough it'll choose to
use a sequential scan. However it would do that regardless of whether you're
calling count(*) or not. If the number is records is substantial then you
would get the overhead of the scan plus the time it takes to transfer all that
unnecessary data to the user.

What you're probably thinking of when you talk about general purpose aggregate
interfaces is the difficulty of making min()/max() use indexes. That's a whole
other case entirely. That's where postgres's generalized aggregates leaves it
without enough information about what records the aggregate functions are
interested in and what index scans might make them faster.

None of these common cases end up making it a good idea to read the records
into the clients and do the work in the client. The only cases where that
would make sense would be if the function requires doing some manipulation of
the data that's awkward to express in sql. The "top n" type of query is the
usual culprit, but with postgres's new array functions even that becomes
tractable.

--
greg

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Rod Taylor 2004-08-29 22:18:20 Re: Why does a simple query not use an obvious index?
Previous Message Steinar H. Gunderson 2004-08-29 22:04:13 Re: Why does a simple query not use an obvious index?