Re: Using quicksort for every external sort run

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2015-11-28 22:04:16
Message-ID: CAMkU=1xvAGBJRvhTr8kn80wu=YaTxL0=R3wrjts0EM-TuqTvBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Nov 18, 2015 at 3:29 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>> I agree we don't want to optimize for low memory, but I don't think we
>> should throw it under the bus, either. Right now we are effectively
>> saying the CPU-cache problems with the heap start exceeding the larger
>> run size benefits at 64kb (the smallest allowed setting for work_mem).
>> While any number we pick is going to be a guess that won't apply to
>> all hardware, surely we can come up with a guess better than 64kb.
>> Like, 8 MB, say. If available memory for the sort is 8MB or smaller
>> and the predicted size anticipates a multipass merge, then we can use
>> the heap method rather than the quicksort method. Would a rule like
>> that complicate things much?
>
> I'm already using replacement selection for the first run when it is
> predicted by my new ad-hoc cost model that we can get away with a
> "quicksort with spillover", avoiding almost all I/O. We only
> incrementally spill as many tuples as needed right now, but it would
> be pretty easy to not quicksort the remaining tuples, but continue to
> incrementally spill everything. So no, it wouldn't be too hard to hang
> on to the old behavior sometimes, if it looked worthwhile.
>
> In principle, I have no problem with doing that. Through testing, I
> cannot see any actual upside, though. Perhaps I just missed something.
> Even 8MB is enough to avoid the multipass merge in the event of a
> surprisingly high volume of data (my work laptop is elsewhere, so I
> don't have my notes on this in front of me, but I figured out the
> crossover point for a couple of cases).

For me very large sorts (100,000,000 ints) with work_mem below 4MB do
better with unpatched than with your patch series, by about 5%. Not a
big deal, but also if it is easy to keep the old behavior then I think
we should. Yes, it is dumb to do large sorts with work_mem below 4MB,
but if you have canned apps which do a mixture of workloads it is not
so easy to micromanage their work_mem. Especially as there are no
easy tools that let me as the DBA say "if you connect from this IP
address, you get this work_mem".

I didn't collect trace_sort on those ones because of the high volume
it would generate.

>
>>> In theory, the answer could be "yes", but it seems highly unlikely.
>>> Not only is very little memory required to avoid a multi-pass merge
>>> step, but as described above the amount required grows very slowly
>>> relative to linear growth in input. I propose to add a
>>> checkpoint_warning style warning (with a checkpoint_warning style GUC
>>> to control it).
>>
>> I'm skeptical about a warning for this.
>
> Other systems expose this explicitly, and, as I said, say in an
> unqualified way that a multi-pass merge should be avoided. Maybe the
> warning isn't the right way of communicating that message to the DBA
> in detail, but I am confident that it ought to be communicated to the
> DBA fairly clearly.

I thinking about how many other places in the code could justify a
similar type of warning "If you just gave me 15% more memory, this
hash join would be much faster", and what that would make the logs
look like if future work went along with this precedence. If there
were some mechanism to put the warning in a system view counter
instead of the log file, that would be much cleaner. Or a way to
separate the server log file into streams. But since we don't have
those, I guess I can't really object much to the proposed behavior.

>
>> One idea would be to stop and write out a just-sorted partition
>> whenever that partition is contiguous to the already-written portion.
>> If the qsort is tweaked to recurse preferentially into the left
>> partition first, this would result in tuples being written out at a
>> pretty study pace. If the qsort was unbalanced and the left partition
>> was always the larger of the two, then that approach would have to be
>> abandoned at some point. But I think there are already defenses
>> against that, and at worst you would give up and revert to the
>> sort-them-all then write-them-all behavior.
>
> Seems kind of invasive.

I agree, but I wonder if it won't become much more important at 30GB
of work_mem. Of course if there is no reason to ever set work_mem
that high, then it wouldn't matter--but there is always a reason to do
so, if you have so much memory to spare. So better than that invasive
work, I guess would be to make sort use less than work_mem if it gets
no benefit from using all of it. Anyway, ideas for future work,
either way.

>
>> Overall this is very nice. Doing some real world index builds of
>> short text (~20 bytes ascii) identifiers, I could easily get speed ups
>> of 40% with your patch if I followed the philosophy of "give it as
>> much maintenance_work_mem as I can afford". If I fine-tuned the
>> maintenance_work_mem so that it was optimal for each sort method, then
>> the speed up quite a bit less, only 22%. But 22% is still very
>> worthwhile, and who wants to spend their time fine-tuning the memory
>> use for every index build?
>
> Thanks, but I expected better than that. Was it a collated text
> column? The C collation will put the patch in a much better light
> (more strcoll() calls are needed with this new approach -- it's still
> well worth it, but it is a downside that makes collated text not
> especially sympathetic). Just sorting on an integer attribute is also
> a good sympathetic case, FWIW.

It was UTF8 encoded (although all characters were actually ASCII), but
C collated.

I've never seen improvements of 3 fold or more like you saw, under any
conditions, so I wonder if your test machine doesn't have unusually
slow main memory.

>
> How much time did the sort take in each case? How many runs? How much
> time was spent merging? trace_sort output is very interesting here.

My largest test, which took my true table and extrapolated it out for
a few years growth, had about 500,000,000 rows.

At 3GB maintainance_work_mem, it took 13 runs patched and 7 runs
unpatched to build the index, with timings of 3168.66 sec and 5713.07
sec.

The final merging is intermixed with whatever other work goes on to
build the actual index files out of the sorted data, so I don't know
exactly what the timing of just the merge part was. But it was
certainly a minority of the time, even if you assume the actual index
build were free. For the patched code, the majority of the time goes
to the quick sorting stages.

When I test each version of the code at its own most efficient
maintenance_work_mem, I get
3007.2 seconds at 1GB for patched and 3836.46 seconds at 64MB for unpatched.

I'm attaching the trace_sort output from the client log for all 4 of
those scenarios. "sort_0005" means all 5 of your patches were
applied, "origin" means none of them were.

Cheers,

Jeff

Attachment Content-Type Size
trace_sort_1GB_sort005.txt text/plain 9.8 KB
trace_sort_64MB_origin.txt text/plain 25.0 KB
trace_sort_3GB_sort_005.txt text/plain 3.9 KB
trace_sort_3GB_origin.txt text/plain 1.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2015-11-28 22:19:35 Re: Using quicksort for every external sort run
Previous Message Jeff Janes 2015-11-28 20:51:58 Re: Freeze avoidance of very large table.