Re: Using quicksort for every external sort run

From: Greg Stark <stark(at)mit(dot)edu>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2015-12-09 02:39:13
Message-ID: CAM-w4HPE8g6mF11DGQ0Fut2J5xj5_9WFu4Lf1SnM=xSQdgv4=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>
> Then in the next (final) merge, it is has to read in this huge
> fragmented tape run emulation, generating a lot of random IO to read
> it.

This seems fairly plausible. Logtape.c is basically implementing a
small filesystem and doesn't really make any attempt to avoid
fragmentation. The reason it does this is so that we can reuse blocks
and avoid needing to store 2x disk space for the temporary space. I
wonder if we're no longer concerned about keeping the number of tapes
down if it makes sense to give up on this goal too and just write out
separate files for each tape letting the filesystem avoid
fragmentation. I suspect it would also be better for filesystems like
ZFS and SSDs where rewriting blocks can be expensive.

> With the patched code, the average length of reads on files in
> pgsql_tmp between lseeks or changing to a different file descriptor is
> 8, while in the unpatched code it is 14.

I don't think Peter did anything to the scheduling of the merges so I
don't see how this would be different. It might just have hit a
preexisting case by changing the number and size of tapes.

I also don't think the tapes really ought to be so unbalanced. I've
noticed some odd things myself -- like what does a 1-way merge mean
here?

LOG: finished writing run 56 to tape 2 (9101313 blocks): CPU
0.19s/10.97u sec elapsed 16.68 sec
LOG: finished writing run 57 to tape 3 (9084929 blocks): CPU
0.19s/11.14u sec elapsed 19.08 sec
LOG: finished writing run 58 to tape 4 (9101313 blocks): CPU
0.20s/11.31u sec elapsed 19.26 sec
LOG: performsort starting: CPU 0.20s/11.48u sec elapsed 19.44 sec
LOG: finished writing run 59 to tape 5 (9109505 blocks): CPU
0.20s/11.49u sec elapsed 19.44 sec
LOG: finished writing final run 60 to tape 6 (8151041 blocks): CPU
0.20s/11.55u sec elapsed 19.50 sec
LOG: finished 1-way merge step (1810433 blocks): CPU 0.20s/11.58u sec
elapsed 19.54 sec <-------------------------=========
LOG: finished 10-way merge step (19742721 blocks): CPU 0.20s/12.23u
sec elapsed 20.19 sec
LOG: finished 13-way merge step (23666689 blocks): CPU 0.20s/13.15u
sec elapsed 21.11 sec
LOG: finished 13-way merge step (47333377 blocks): CPU 0.22s/14.07u
sec elapsed 23.13 sec
LOG: finished 14-way merge step (47333377 blocks): CPU 0.24s/15.65u
sec elapsed 24.74 sec
LOG: performsort done (except 14-way final merge): CPU 0.24s/15.66u
sec elapsed 24.75 sec

I wonder if something's wrong with the merge scheduling.

Fwiw attached are two patches for perusal. One is a trivial patch to
add the size of the tape to trace_sort output. I guess I'll just apply
that without discussion. The other replaces the selection sort with an
open coded sort network for cases up to 8 elements. (Only in the perl
generated qsort for the moment). I don't have the bandwidth to
benchmark this for the moment but if anyone's interested in trying I
suspect it'll make a small but noticeable difference. I'm guessing
2-5%.

--
greg

Attachment Content-Type Size
0001-Add-logical-tape-size-to-TRACE_SORT-output.patch text/x-patch 2.1 KB
0002-replace-selection-sort-base-case-with-optimal-open-c.patch text/x-patch 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-12-09 02:44:29 Re: Using quicksort for every external sort run
Previous Message Michael Paquier 2015-12-09 02:29:31 Re: PostgresNode::_update_pid using undefined variables in tap tests