Re: Using quicksort for every external sort run

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2015-12-13 22:43:55
Message-ID: CAMkU=1wcoiY7LaV=E8YmtzfG40Y8Q-VNCqKDGvAqGnvpQX_EmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>>
>> Then in the next (final) merge, it is has to read in this huge
>> fragmented tape run emulation, generating a lot of random IO to read
>> it.
>
> This seems fairly plausible. Logtape.c is basically implementing a
> small filesystem and doesn't really make any attempt to avoid
> fragmentation. The reason it does this is so that we can reuse blocks
> and avoid needing to store 2x disk space for the temporary space. I
> wonder if we're no longer concerned about keeping the number of tapes
> down if it makes sense to give up on this goal too and just write out
> separate files for each tape letting the filesystem avoid
> fragmentation. I suspect it would also be better for filesystems like
> ZFS and SSDs where rewriting blocks can be expensive.

During my testing I actually ran into space problems, where the index
I was building and the temp files used to do the sort for it could not
coexist, and I was wondering if there wasn't a way to free up some of
those temp files as the index was growing. So I don't think we want to
throw caution to the wind here.

(Also, I think it does make *some* attempt to reduce fragmentation,
but it could probably do more.)

>
>
>> With the patched code, the average length of reads on files in
>> pgsql_tmp between lseeks or changing to a different file descriptor is
>> 8, while in the unpatched code it is 14.
>
> I don't think Peter did anything to the scheduling of the merges so I
> don't see how this would be different. It might just have hit a
> preexisting case by changing the number and size of tapes.

Correct. (There was a small additional increase with the memory pool,
but it was small enough that I am not worried about it). But, this
changing number and size of tapes was exactly what Robert was worried
about, so I don't want to just dismiss it without further
investigation.

>
> I also don't think the tapes really ought to be so unbalanced. I've
> noticed some odd things myself -- like what does a 1-way merge mean
> here?

I noticed some of those (although in my case they were always the
first merges which were one-way) and I just attributed it to the fact
that the algorithm doesn't know how many runs there will be up front,
and so can't optimally distribute them among the tapes.

But it does occur to me that we are taking the tape analogy rather too
far in that case. We could say that we have only 223 tape *drives*,
but that each run is a separate tape which can be remounted amongst
the drives in any combination, as long as only 223 are active at one
time.

I started looking into this at one time, before I got sidetracked on
the fact that the memory usage pattern would often leave a few bytes
less than half of work_mem completely unused. Once that memory usage
got fixed, I never returned to the original examination. And it would
be a shame to sink more time into it now, when we are trying to avoid
these polyphase merges altogether.

So, is a sometimes-regression at 64MB really a blocker to substantial
improvement most of the time at 64MB, and even more so at more
realistic modern settings for large index building?

> Fwiw attached are two patches for perusal. One is a trivial patch to
> add the size of the tape to trace_sort output. I guess I'll just apply
> that without discussion.

+1 there. Having this in place would make evaluating the other things
be easier.

Cheers,

Jeff

On Tue, Dec 8, 2015 at 6:39 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Wed, Dec 9, 2015 at 12:02 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>>
>> Then in the next (final) merge, it is has to read in this huge
>> fragmented tape run emulation, generating a lot of random IO to read
>> it.
>
> This seems fairly plausible. Logtape.c is basically implementing a
> small filesystem and doesn't really make any attempt to avoid
> fragmentation. The reason it does this is so that we can reuse blocks
> and avoid needing to store 2x disk space for the temporary space. I
> wonder if we're no longer concerned about keeping the number of tapes
> down if it makes sense to give up on this goal too and just write out
> separate files for each tape letting the filesystem avoid
> fragmentation. I suspect it would also be better for filesystems like
> ZFS and SSDs where rewriting blocks can be expensive.
>
>
>> With the patched code, the average length of reads on files in
>> pgsql_tmp between lseeks or changing to a different file descriptor is
>> 8, while in the unpatched code it is 14.
>
> I don't think Peter did anything to the scheduling of the merges so I
> don't see how this would be different. It might just have hit a
> preexisting case by changing the number and size of tapes.
>
> I also don't think the tapes really ought to be so unbalanced. I've
> noticed some odd things myself -- like what does a 1-way merge mean
> here?
>
> LOG: finished writing run 56 to tape 2 (9101313 blocks): CPU
> 0.19s/10.97u sec elapsed 16.68 sec
> LOG: finished writing run 57 to tape 3 (9084929 blocks): CPU
> 0.19s/11.14u sec elapsed 19.08 sec
> LOG: finished writing run 58 to tape 4 (9101313 blocks): CPU
> 0.20s/11.31u sec elapsed 19.26 sec
> LOG: performsort starting: CPU 0.20s/11.48u sec elapsed 19.44 sec
> LOG: finished writing run 59 to tape 5 (9109505 blocks): CPU
> 0.20s/11.49u sec elapsed 19.44 sec
> LOG: finished writing final run 60 to tape 6 (8151041 blocks): CPU
> 0.20s/11.55u sec elapsed 19.50 sec
> LOG: finished 1-way merge step (1810433 blocks): CPU 0.20s/11.58u sec
> elapsed 19.54 sec <-------------------------=========
> LOG: finished 10-way merge step (19742721 blocks): CPU 0.20s/12.23u
> sec elapsed 20.19 sec
> LOG: finished 13-way merge step (23666689 blocks): CPU 0.20s/13.15u
> sec elapsed 21.11 sec
> LOG: finished 13-way merge step (47333377 blocks): CPU 0.22s/14.07u
> sec elapsed 23.13 sec
> LOG: finished 14-way merge step (47333377 blocks): CPU 0.24s/15.65u
> sec elapsed 24.74 sec
> LOG: performsort done (except 14-way final merge): CPU 0.24s/15.66u
> sec elapsed 24.75 sec
>
> I wonder if something's wrong with the merge scheduling.
>
> Fwiw attached are two patches for perusal. One is a trivial patch to
> add the size of the tape to trace_sort output. I guess I'll just apply
> that without discussion. The other replaces the selection sort with an
> open coded sort network for cases up to 8 elements. (Only in the perl
> generated qsort for the moment). I don't have the bandwidth to
> benchmark this for the moment but if anyone's interested in trying I
> suspect it'll make a small but noticeable difference. I'm guessing
> 2-5%.
>
> --
> greg

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2015-12-13 23:10:12 Re: Making tab-complete.c easier to maintain
Previous Message Peter Geoghegan 2015-12-13 21:30:02 Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore