On Thu, Oct 9, 2008 at 6:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Given that PostgreSQL does the scan even with the huge seqscan
>> penalty, I can think of only two different causes:
>> 1) some sort of a bug in the query analyzer
>> 2) SELECT COUNT(DISTINCT x) for some reason requires information that
>> is not available in the index.
> Try (3) COUNT(DISTINCT x) ... or any DISTINCT aggregate for that matter
> ... is implemented by a sort-and-uniq step inside the aggregate function
> itself. You can't see it in the plan.
Does this mean that the sort-and-uniq will always lead to a full table
scan? It would seem so, because I could not force PostgreSQL to use
the index even with enable_seqscan set to off. I understand that in
some cases the table scan is faster, but it is very strange if the
query optimizer refuses to use the index no matter what.
A quick calculation says that the table scan needs to access 32
million elements (and sort them, and uniq them). An index scan needs
only 2 million (or 4 million I suppose, if you account for the higher
levels in the B-tree). That is an order of magnitude difference in
disk reads alone.
> I wouldn't actually think that this approach would be slower than an
> indexscan, btw, unless maybe the index were very nearly correlated with
> physical order --- but that would make the sort more efficient, too.
I have ordered the data table according to the "code" column by first
importing it to a temp table and then building the actual "log" table
with CREATE TABLE log AS SELECT ... ORDER BY code;.
In this case the index is faster. A lot faster. SELECT DISTINCT using
the index and counting the rows takes 4 minutes. Building a view with
distinct "code"s and counting that also takes 4 minutes. But the
aggregate scan takes 11 minutes. Worst of all, COUNT(DISTINCT ...) is
the way all SQL books tell you to do these kinds of queries.
> Perhaps you need to raise work_mem enough to allow the sort to take
> place without spilling to disk? (Turning on trace_sort should let you
> see what's happening there.)
But isn't this just hiding the root cause? Having working sets larger
than available memory is not that uncommon.
In response to
pgsql-bugs by date
|Next:||From: Tom Lane||Date: 2008-10-10 21:07:00|
|Subject: Re: BUG #4462: Adding COUNT to query causes massive slowdown |
|Previous:||From: Peter Eisentraut||Date: 2008-10-10 16:56:30|
|Subject: Re: BUG #4465: GROUP BY is not to SQL standard|