Re: Unacceptable postgres performance vs. Microsoft sqlserver

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Ivan Sergio Borgonovo" <mail(at)webthatworks(dot)it>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Unacceptable postgres performance vs. Microsoft sqlserver
Date: 2008-04-14 20:30:15
Message-ID: 87r6d8faco.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

"Ivan Sergio Borgonovo" <mail(at)webthatworks(dot)it> writes:

> But why once you add the index and count distinct the performances
> are still so far?
> I'd say that counting in this case is not the hardest thing to do,
> but rather the "distinct" part.

Your tests have been a bit unfortunate in finding a few particularly soft bits
in Postgres's underbelly.

a) Postgres can't do "index-only" scans so the index doesn't really help in
this case. There's some discussion about improving that soon but it's still
early.

b) Postgres doesn't know how to pick just distinct values out of an index, it
has to find a start point and read all the records from that point forward
so even if this is a very low cardinality value it would have to read the
whole index.

c) DISTINCT is one of the earlier features in Postgres and a lot planner code
is a lot smarter. In particular it doesn't know about using hash tables to
find distinct values. This you can work around by rewriting the query as
GROUP BY goes through a more recent code path.

d) The default settings if you haven't tuned postgresql.conf are quite
conservative. That alone often makes it look poor when compared against
other databases. In particular it makes it unlikely to pick hash tables for
things or do in-memory sorts unless you raise work_mem.

This is what you're doing now:

postgres=# explain select count(distinct aid) from accounts;
QUERY PLAN
----------------------------------------------------------------------
Aggregate (cost=2838.00..2838.01 rows=1 width=4)
-> Seq Scan on accounts (cost=0.00..2588.00 rows=100000 width=4)
(2 rows)

This is what you're expecting it to do (on a freshly built clustered index it
might perform ok but on a non-clustered index it will be terrible):

postgres=# explain select count(*) from (select 1 from accounts group by aid) as a;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Aggregate (cost=5475.26..5475.27 rows=1 width=0)
-> Group (cost=0.00..4225.26 rows=100000 width=4)
-> Index Scan using accounts_pkey on accounts (cost=0.00..3975.26 rows=100000 width=4)
(3 rows)

postgres=# set work_mem = '128M';
ERROR: invalid value for parameter "work_mem": "128M"
HINT: Valid units for this parameter are "kB", "MB", and "GB".

(ARGH! DAMN YOU PETER!!!!)

This is probably the best you can get Postgres to do currently:

postgres=# set work_mem = '128MB';
SET

postgres=# explain select count(*) from (select 1 from accounts group by aid) as a;
QUERY PLAN
----------------------------------------------------------------------------
Aggregate (cost=5088.00..5088.01 rows=1 width=0)
-> HashAggregate (cost=2838.00..3838.00 rows=100000 width=4)
-> Seq Scan on accounts (cost=0.00..2588.00 rows=100000 width=4)
(3 rows)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ivan Sergio Borgonovo 2008-04-14 21:36:14 Re: Unacceptable postgres performance vs. Microsoft sqlserver
Previous Message Andreas 'ads' Scherbaum 2008-04-14 20:04:24 Re: Unacceptable postgres performance vs. Microsoft sqlserver