Quick Links

Re: Sort time

From:	pginfo <pginfo(at)t1(dot)unisoftbg(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Sort time
Date:	2002-11-18 06:10:13
Message-ID:	3DD88445.F617D9B@t1.unisoftbg.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hi Tom,

The idea is very good.
I recreated the tables and for all IDS_xxx I used name (not varchar(20)).
The the is also in unicode.
I ran the query and got huge improvement!
The work time is 166 sek. ( before it was ~320 - 340 sek.).

I will continue to make new tests and play around the setups.
I think all this can be more quicker. I expect to get ~ 45-60 sek. ( this is the time in oracle), but laso 166 sek is good.

I think that we need to work around the non us sorting and compare.
It is not possible to be so slow (all the functions are executed in memory
and in java and by oracle and by ms all this is working very fast).

regards,
ivan.

Tom Lane wrote:

> Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> > As a note, the same data loaded into a non-"C" locale database took about
> > 42 seconds on the same machine, approximately 2.5x as long.
>
> The non-C locale is undoubtedly the problem. I made a test dataset of
> 700000 all-alphabetic 20-character random strings:
>
> $ head rand.data
> duofoesrlycdnilvlcrg
> crealjdrjpyczfbnlouo
> lxaiyicslwjnxgpehtzp
> ykizuovkvpkvvqsaocys
> rkkvrqfiiybczwqdvvfu
> stonxhbbvgwtjszodguv
> prqxhwcfibiopjpiddud
> ubgexbfdodhnauytebcf
> urfoqifgbrladpssrwzw
> ydcrsnxjpxospfqqoilw
>
> I performed the following experiment in 7.3 using a database in
> en_US locale, SQL_ASCII encoding:
>
> enus=# create table vc20 (f1 varchar(20));
> CREATE TABLE
> enus=# \copy vc20 from rand.data
> \.
> enus=# vacuum analyze vc20;
> VACUUM
> enus=# set sort_mem to 50000;
> SET
> enus=# explain analyze select count(*) from
> enus-# (select * from vc20 order by f1) ss;
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=83607.48..83607.48 rows=1 width=24) (actual time=1058167.66..1058167.67 rows=1 loops=1)
> -> Subquery Scan ss (cost=80107.48..81857.48 rows=700000 width=24) (actual time=1022972.86..1049559.50 rows=700000 loops=1)
> -> Sort (cost=80107.48..81857.48 rows=700000 width=24) (actual time=1022972.80..1034036.58 rows=700000 loops=1)
> Sort Key: f1
> -> Seq Scan on vc20 (cost=0.00..12148.00 rows=700000 width=24) (actual time=0.20..24651.65 rows=700000 loops=1)
> Total runtime: 1058220.10 msec
> (6 rows)
>
> (The point of the select count(*) was to avoid shipping the result rows
> to the client, but in hindsight "explain analyze" would suppress that
> anyway. But the main datapoint here is the time for the Sort step.)
>
> I tried the test using datatype NAME as well, since it sorts using
> plain strcmp() instead of strcoll():
>
> enus=# create table nm (f1 name);
> CREATE TABLE
> enus=# insert into nm select f1 from vc20;
> INSERT 0 700000
> enus=# vacuum analyze nm;
> VACUUM
> enus=# set sort_mem to 50000;
> SET
> enus=# explain analyze select count(*) from
> enus-# (select * from nm order by f1) ss;
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=122701.48..122701.48 rows=1 width=64) (actual time=157877.84..157877.85 rows=1 loops=1)
> -> Subquery Scan ss (cost=119201.48..120951.48 rows=700000 width=64) (actual time=121286.65..149376.93 rows=700000 loops=1)
> -> Sort (cost=119201.48..120951.48 rows=700000 width=64) (actual time=121286.60..134075.61 rows=700000 loops=1)
> Sort Key: f1
> -> Seq Scan on nm (cost=0.00..15642.00 rows=700000 width=64) (actual time=0.21..24150.57 rows=700000 loops=1)
> Total runtime: 157962.79 msec
> (6 rows)
>
> In C locale, the identical test sequence gives
>
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=83607.48..83607.48 rows=1 width=24) (actual time=187480.70..187480.71 rows=1 loops=1)
> -> Subquery Scan ss (cost=80107.48..81857.48 rows=700000 width=24) (actual time=141100.03..178625.97 rows=700000 loops=1)
> -> Sort (cost=80107.48..81857.48 rows=700000 width=24) (actual time=141099.98..162288.95 rows=700000 loops=1)
> Sort Key: f1
> -> Seq Scan on vc20 (cost=0.00..12148.00 rows=700000 width=24) (actual time=0.20..23954.71 rows=700000 loops=1)
> Total runtime: 187565.79 msec
> (6 rows)
>
> and of course about the same runtime as before for datatype NAME. So on
> this platform (HPUX 10.20), en_US locale incurs about a 6x penalty over
> C locale for sorting varchars.
>
> Note that NAME beats VARCHAR by a noticeable margin even in C locale,
> despite the handicap of requiring much more I/O (being 64 bytes per row
> not 24). This surprises me; it looks like varstr_cmp() is reasonably
> well optimized in the C-locale case. But the real loser is VARCHAR in
> non-C locales. I suspect the primary time sink is strcoll() not the
> palloc/copy overhead in varstr_cmp(), but don't have time right now to
> do profiling to prove it.
>
> Anyway, use of NAME instead of VARCHAR might be a workable workaround
> if you cannot change your database locale to C.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

In response to

Re: Sort time at 2002-11-17 20:56:13 from Tom Lane

Responses

Re: Sort time at 2002-11-18 10:16:14 from pginfo
Re: Sort time at 2002-11-18 11:56:50 from Stephan Szabo

Browse pgsql-performance by date

	From	Date	Subject
Next Message	pginfo	2002-11-18 10:16:14	Re: Sort time
Previous Message	Tom Lane	2002-11-17 23:54:42	Re: Sort time