Re: Using quicksort for every external sort run

From: Greg Stark <stark(at)mit(dot)edu>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2015-12-07 13:17:25
Message-ID: CAM-w4HM4XW3u5kVEuUrr+L+KX3WZ=5JKk0A=DJjzypkB-Hyu4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

​So incidentally I've been running some benchmarks myself. Mostly to
understand the current scaling behaviour of sorting to better judge whether
Peter's analysis of where the pain points are and why we should not worry
about optimizing for the multiple merge pass case were on target. I haven't
actually benchmarked his patch at all, just stock head so far.

The really surprising result (for me) so far is that apparently merge
passes spent actually very little time doing I/O. I had always assumed most
of the time was spent waiting on I/O and that's why we spend so much effort
ensuring sequential I/O and trying to maximize run lengths. I was expecting
to see a huge step increase in the total time whenever there was an
increase in merge passes. However I see hardly any increase, sometimes even
a decrease despite the extra pass. The time generally increases as work_mem
decreases but the slope is pretty moderate and gradual with no big steps
due to extra passes.

On further analysis I'm less surprised by this than previously. The larger
benchmarks I'm running are on a 7GB table which only actually generates
2.6GB of sort data so even writing all that out and then reading it all
back in on a 100MB/s disk would only take an extra 50s. That won't make a
big dent when the whole sort takes about 30 minutes. Even if you assume
there's a substantial amount of random I/O it'll only be a 10% difference
or so which is more or less in line with what I'm seeing.

I haven't actually got to benchmarking Peter's patch at all but this is
reinforcing his argument dramatically. If the worst case for using
quicksort is that the shorter runs might push us into doing an extra merge
and that might add an extra 10% to the run-time that will be easily
counter-balanced by the faster quicksort and in any case it only affects
people who for some reason can't just increase work_mem to allow the single
merge mode.

Table SizeSort Size128MB64MB32MB16MB8MB4MB6914MB2672 MB3392.293102.133343.53
4081.234727.745620.773457MB1336 MB1669.161593.851444.221654.272076.742266.84
2765MB1069 MB1368.921250.441117.21293.451431.641772.181383MB535 MB716.48
625.06557.14575.67644.2721.68691MB267 MB301.08295.87266.84256.29283.82292.24
346MB134 MB145.48149.48133.23130.69127.67137.7435MB13 MB3.5816.7711.2311.93
13.973.17

The colours are to give an idea of the number of merge passes. Grey, is an
internal sort. White is a single merge. Yellow and red are successively
more merges (though the exact boundary between yellow and red may not be
exactly meaningful due to my misunderstanding polyphase merge).

The numbers here are seconds taken from the "elapsed" in the following log
statements when running queries like the following with trace_sort enabled:
LOG: external sort ended, 342138 disk blocks used: CPU 276.04s/3173.04u
sec elapsed 5620.77 sec
STATEMENT: select count(*) from (select * from n200000000 order by r
offset 99999999999) AS x;

This was run on the smallest size VM on Google Compute Engine with 600MB of
virtual RAM and a 100GB virtual network block device.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2015-12-07 13:33:43 Re: [HACKERS] max_worker_processes on the standby
Previous Message Michael Paquier 2015-12-07 12:14:55 Re: HINTing on UPDATE foo SET foo.bar = ..;