Re: BUG #4462: Adding COUNT to query causes massive slowdown

From: "Jussi Pakkanen" <jpakkane(at)gmail(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #4462: Adding COUNT to query causes massive slowdown
Date: 2008-10-10 19:55:10
Message-ID: 42d23b2e0810101255p59d5d1eyce79a0c917bd72ed@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Oct 9, 2008 at 6:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>> Given that PostgreSQL does the scan even with the huge seqscan
>> penalty, I can think of only two different causes:
>> 1) some sort of a bug in the query analyzer
>> 2) SELECT COUNT(DISTINCT x) for some reason requires information that
>> is not available in the index.
>
> Try (3) COUNT(DISTINCT x) ... or any DISTINCT aggregate for that matter
> ... is implemented by a sort-and-uniq step inside the aggregate function
> itself. You can't see it in the plan.

Does this mean that the sort-and-uniq will always lead to a full table
scan? It would seem so, because I could not force PostgreSQL to use
the index even with enable_seqscan set to off. I understand that in
some cases the table scan is faster, but it is very strange if the
query optimizer refuses to use the index no matter what.

A quick calculation says that the table scan needs to access 32
million elements (and sort them, and uniq them). An index scan needs
only 2 million (or 4 million I suppose, if you account for the higher
levels in the B-tree). That is an order of magnitude difference in
disk reads alone.

> I wouldn't actually think that this approach would be slower than an
> indexscan, btw, unless maybe the index were very nearly correlated with
> physical order --- but that would make the sort more efficient, too.

I have ordered the data table according to the "code" column by first
importing it to a temp table and then building the actual "log" table
with CREATE TABLE log AS SELECT ... ORDER BY code;.

In this case the index is faster. A lot faster. SELECT DISTINCT using
the index and counting the rows takes 4 minutes. Building a view with
distinct "code"s and counting that also takes 4 minutes. But the
aggregate scan takes 11 minutes. Worst of all, COUNT(DISTINCT ...) is
the way all SQL books tell you to do these kinds of queries.

> Perhaps you need to raise work_mem enough to allow the sort to take
> place without spilling to disk? (Turning on trace_sort should let you
> see what's happening there.)

But isn't this just hiding the root cause? Having working sets larger
than available memory is not that uncommon.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2008-10-10 21:07:00 Re: BUG #4462: Adding COUNT to query causes massive slowdown
Previous Message Peter Eisentraut 2008-10-10 16:56:30 Re: BUG #4465: GROUP BY is not to SQL standard