Re: How slow is distinct - 2nd

From: Michael Contzen <mcontzen(at)dohle(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Cc: Bruno Wolff III <bruno(at)wolff(dot)to>
Subject: Re: How slow is distinct - 2nd
Date: 2002-10-07 12:21:04
Message-ID: 3DA17C30.70D81D14@dohle.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Bruno Wolff III schrieb:
>
> On Tue, Oct 01, 2002 at 14:18:50 +0200,
> Michael Contzen <Michael(dot)Contzen(at)dohle(dot)com> wrote:
> > Here the table:
> >
> > mc=# \d egal
> > Table "public.egal"
> > Column | Type | Modifiers
> > --------+---------+-----------
> > i | integer |
> >
> > mc=# select count(*) from egal;
> > count
> > ---------
> > 7227744
> > (1 row)
> >
> > mc=# select count(distinct i) from egal;
> > count
> > -------
> > 67
> > (1 row)
>
> This suggests that the best way to do this is with a hash instead of a sort.
>
> If you have lots of memory you might try increasing the sort memory size.

Hello,

ok, sort_mem was still set to the default (=1024). I've increased it to
sort_mem=10240
which results to: (same machine, same data, etc.)

time echo "select distinct i from egal;"|psql mc >/dev/null

real 2m30.667s
user 0m0.000s
sys 0m0.010s

If I set sort_mem=1024000:

time echo "select distinct i from egal;"|psql mc >/dev/null

real 0m52.274s
user 0m0.020s
sys 0m0.000s

wow, in comparison to nearly 5 minutes before this is quite good
speedup.

But:

All the work could be done in memory as the processor load shows (output
of top, which shows the following output during all the time)

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
8310 postgres 17 0 528M 528M 2712 R 99.9 13.5 0:11 postmaster

Even it nearly performs 5 times faster than before with 1M memory,
postgres is still
8 times slower than oracle. Further increasing of sort_mem to 4096000
doesn't reduce the
time, as the cpu load cannot increased any more :-)

But increasing the memory in that way is not realy a solution: Normaly
not all the data
fits into memory. In our application I guess 10%.

Oracle has even less memory:

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
8466 oracle 14 0 11052 10M 10440 R 99.9 0.2 0:01 oracle

(this 10M session memory plus 32M shared memory pool not shown here).

This shows to me, that oracle uses a quite different algorithm for this
task. May be it
uses some hashing-like algorithm first without sorting before. I don't
know oracle enough, perhaps this is that "sort unique" step in the
planners output.
I think, first Postgres sorts all the data, which results to temporary
data of the same
size than before and which needs to be written to disk at least once,
and after that postgres does the unique operation, right?

If I can do any more tests to oracle or postgres, let me know.

Kind regards,

Michael Contzen

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Larry Rosenman 2002-10-07 12:25:08 Re: timestamp
Previous Message Larry Rosenman 2002-10-07 12:04:33 Re: [HACKERS] cross-posts (was Re: Large databases,