Quick Links

Re: Sort time

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
Cc:	pginfo <pginfo(at)t1(dot)unisoftbg(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Sort time
Date:	2002-11-17 20:56:13
Message-ID:	6813.1037566573@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> As a note, the same data loaded into a non-"C" locale database took about
> 42 seconds on the same machine, approximately 2.5x as long.

The non-C locale is undoubtedly the problem. I made a test dataset of
700000 all-alphabetic 20-character random strings:

$ head rand.data
duofoesrlycdnilvlcrg
crealjdrjpyczfbnlouo
lxaiyicslwjnxgpehtzp
ykizuovkvpkvvqsaocys
rkkvrqfiiybczwqdvvfu
stonxhbbvgwtjszodguv
prqxhwcfibiopjpiddud
ubgexbfdodhnauytebcf
urfoqifgbrladpssrwzw
ydcrsnxjpxospfqqoilw

I performed the following experiment in 7.3 using a database in
en_US locale, SQL_ASCII encoding:

enus=# create table vc20 (f1 varchar(20));
CREATE TABLE
enus=# \copy vc20 from rand.data
\.
enus=# vacuum analyze vc20;
VACUUM
enus=# set sort_mem to 50000;
SET
enus=# explain analyze select count(*) from
enus-# (select * from vc20 order by f1) ss;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=83607.48..83607.48 rows=1 width=24) (actual time=1058167.66..1058167.67 rows=1 loops=1)
-> Subquery Scan ss (cost=80107.48..81857.48 rows=700000 width=24) (actual time=1022972.86..1049559.50 rows=700000 loops=1)
-> Sort (cost=80107.48..81857.48 rows=700000 width=24) (actual time=1022972.80..1034036.58 rows=700000 loops=1)
Sort Key: f1
-> Seq Scan on vc20 (cost=0.00..12148.00 rows=700000 width=24) (actual time=0.20..24651.65 rows=700000 loops=1)
Total runtime: 1058220.10 msec
(6 rows)

(The point of the select count(*) was to avoid shipping the result rows
to the client, but in hindsight "explain analyze" would suppress that
anyway. But the main datapoint here is the time for the Sort step.)

I tried the test using datatype NAME as well, since it sorts using
plain strcmp() instead of strcoll():

enus=# create table nm (f1 name);
CREATE TABLE
enus=# insert into nm select f1 from vc20;
INSERT 0 700000
enus=# vacuum analyze nm;
VACUUM
enus=# set sort_mem to 50000;
SET
enus=# explain analyze select count(*) from
enus-# (select * from nm order by f1) ss;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=122701.48..122701.48 rows=1 width=64) (actual time=157877.84..157877.85 rows=1 loops=1)
-> Subquery Scan ss (cost=119201.48..120951.48 rows=700000 width=64) (actual time=121286.65..149376.93 rows=700000 loops=1)
-> Sort (cost=119201.48..120951.48 rows=700000 width=64) (actual time=121286.60..134075.61 rows=700000 loops=1)
Sort Key: f1
-> Seq Scan on nm (cost=0.00..15642.00 rows=700000 width=64) (actual time=0.21..24150.57 rows=700000 loops=1)
Total runtime: 157962.79 msec
(6 rows)

In C locale, the identical test sequence gives

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=83607.48..83607.48 rows=1 width=24) (actual time=187480.70..187480.71 rows=1 loops=1)
-> Subquery Scan ss (cost=80107.48..81857.48 rows=700000 width=24) (actual time=141100.03..178625.97 rows=700000 loops=1)
-> Sort (cost=80107.48..81857.48 rows=700000 width=24) (actual time=141099.98..162288.95 rows=700000 loops=1)
Sort Key: f1
-> Seq Scan on vc20 (cost=0.00..12148.00 rows=700000 width=24) (actual time=0.20..23954.71 rows=700000 loops=1)
Total runtime: 187565.79 msec
(6 rows)

and of course about the same runtime as before for datatype NAME. So on
this platform (HPUX 10.20), en_US locale incurs about a 6x penalty over
C locale for sorting varchars.

Note that NAME beats VARCHAR by a noticeable margin even in C locale,
despite the handicap of requiring much more I/O (being 64 bytes per row
not 24). This surprises me; it looks like varstr_cmp() is reasonably
well optimized in the C-locale case. But the real loser is VARCHAR in
non-C locales. I suspect the primary time sink is strcoll() not the
palloc/copy overhead in varstr_cmp(), but don't have time right now to
do profiling to prove it.

Anyway, use of NAME instead of VARCHAR might be a workable workaround
if you cannot change your database locale to C.

regards, tom lane

In response to

Re: Sort time at 2002-11-17 17:29:50 from Stephan Szabo

Responses

Re: Sort time at 2002-11-17 23:05:20 from Tom Lane
Re: Sort time at 2002-11-18 06:10:13 from pginfo

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Hannu Krosing	2002-11-17 21:10:12	Re: Sort time
Previous Message	Stephan Szabo	2002-11-17 17:29:50	Re: Sort time