Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Using quicksort for every external sort run
Date: 2015-11-29 00:05:17
Message-ID: CAM3SWZTYneCG1oZiPwRU=J6ks+VpRxt2Da1ZMmqFBrd5jaSJSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Nov 28, 2015 at 2:04 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> For me very large sorts (100,000,000 ints) with work_mem below 4MB do
> better with unpatched than with your patch series, by about 5%. Not a
> big deal, but also if it is easy to keep the old behavior then I think
> we should. Yes, it is dumb to do large sorts with work_mem below 4MB,
> but if you have canned apps which do a mixture of workloads it is not
> so easy to micromanage their work_mem. Especially as there are no
> easy tools that let me as the DBA say "if you connect from this IP
> address, you get this work_mem".

I'm not very concerned about a regression that is only seen when
work_mem is set below the (very conservative) postgresql.conf default
value of 4MB when sorting 100 million integers. Thank you for
characterizing the regression, though -- it's good to have a better
idea of how much of a problem that is in practice.

I can still preserve the old behavior with a GUC, but it isn't
completely trivial, and I don't want to complicate things any further
without a real benefit, which I still don't see. I'm still using a
replacement selection style heap, and I think that there will be
future uses for the heap (e.g. dynamic duplicate removal within
tuplesort), though.

>> Other systems expose this explicitly, and, as I said, say in an
>> unqualified way that a multi-pass merge should be avoided. Maybe the
>> warning isn't the right way of communicating that message to the DBA
>> in detail, but I am confident that it ought to be communicated to the
>> DBA fairly clearly.
>
> I thinking about how many other places in the code could justify a
> similar type of warning "If you just gave me 15% more memory, this
> hash join would be much faster", and what that would make the logs
> look like if future work went along with this precedence. If there
> were some mechanism to put the warning in a system view counter
> instead of the log file, that would be much cleaner. Or a way to
> separate the server log file into streams. But since we don't have
> those, I guess I can't really object much to the proposed behavior.

I'm going to let this go, actually. Not because I don't think that
avoiding a multi-pass sort is a good goal for DBAs to have, but
because a multi-pass sort doesn't appear to be a point at which
performance tanks these days, with modern block devices. Also, I just
don't have time to push something non-essential that there is
resistance to.

>>> One idea would be to stop and write out a just-sorted partition
>>> whenever that partition is contiguous to the already-written portion.
>>> If the qsort is tweaked to recurse preferentially into the left
>>> partition first, this would result in tuples being written out at a
>>> pretty study pace. If the qsort was unbalanced and the left partition
>>> was always the larger of the two, then that approach would have to be
>>> abandoned at some point. But I think there are already defenses
>>> against that, and at worst you would give up and revert to the
>>> sort-them-all then write-them-all behavior.
>>
>> Seems kind of invasive.
>
> I agree, but I wonder if it won't become much more important at 30GB
> of work_mem. Of course if there is no reason to ever set work_mem
> that high, then it wouldn't matter--but there is always a reason to do
> so, if you have so much memory to spare. So better than that invasive
> work, I guess would be to make sort use less than work_mem if it gets
> no benefit from using all of it. Anyway, ideas for future work,
> either way.

I hope to come up with a fairly robust model for automatically sizing
an "effective work_mem" in the context of external sorts. There should
be a heuristic that balances fan-in against other considerations. I
think that doing this with the existing external sort code would be
completely hopeless. This is a problem that is well understood by the
research community, although balances things well in the context of
PostgreSQL is a little trickier.

I also think it's a little arbitrary that the final on-the-fly merge
step uses a work_mem-ish sized buffer, much like the sorting of runs,
as if there is a good reason to be consistent. Maybe that's fine,
though.

There are advantages to returning tuples earlier in the context of
parallelism, which recommends smaller effective work_mem sizes
(provided they're above a certain threshold). For this reason, having
larger runs may not be a useful goal in general, even without
considering the cost in cache misses paid in the pursuit that goal.

>> Thanks, but I expected better than that. Was it a collated text
>> column? The C collation will put the patch in a much better light
>> (more strcoll() calls are needed with this new approach -- it's still
>> well worth it, but it is a downside that makes collated text not
>> especially sympathetic). Just sorting on an integer attribute is also
>> a good sympathetic case, FWIW.
>
> It was UTF8 encoded (although all characters were actually ASCII), but
> C collated.

I think that I should have considered that you'd hand-optimized the
work_mem setting for each case in reacting here -- I was at a
conference when I responded. You can show the existing code in a
better light by doing that, as you have, but I think it's all but
irrelevant. It isn't even practical for experts to do that, so the
fact that it is possible is only really a footnote. My choice of
work_mem for my tests tended to be round numbers, like 1GB, because
that was the first thing I thought of.

> I've never seen improvements of 3 fold or more like you saw, under any
> conditions, so I wonder if your test machine doesn't have unusually
> slow main memory.

I think that there is a far simpler explanation. Any time I reported a
figure over ~2.5x, it was for "quicksort with spillover", and with a
temp tablespace on tmpfs to simulate lots of I/O bandwidth (but with
hardly any actual writing to tape -- that's the whole point of that
case). I also think that the heap structure does very badly with low
cardinality sets, which is where the 3.25X - 4X numbers came from. You
haven't tested "quicksort with spillover" here at all, which is fine,
since it is less important. Finally, as I said, I did not give the
master branch the benefit of fine-tuning work_mem (which I think is
fair and representative).

> My largest test, which took my true table and extrapolated it out for
> a few years growth, had about 500,000,000 rows.

Cool.

> At 3GB maintainance_work_mem, it took 13 runs patched and 7 runs
> unpatched to build the index, with timings of 3168.66 sec and 5713.07
> sec.
>
> The final merging is intermixed with whatever other work goes on to
> build the actual index files out of the sorted data, so I don't know
> exactly what the timing of just the merge part was. But it was
> certainly a minority of the time, even if you assume the actual index
> build were free. For the patched code, the majority of the time goes
> to the quick sorting stages.

I'm not sure what you mean here. I agree that the work of (say)
inserting leaf tuples as part of an index build is kind of the same
cost as the merge step itself, or doesn't vary markedly between the
CREATE INDEX case, and other cases (where there is some analogous
processing of final sorted output).

I would generally expect that the merge phase takes significantly less
than sorting runs, regardless of how we sort runs, unless parallelism
is involved, where merging could dominate. The master branch has a
faster merge step, at least proportionally, because it has larger
runs.

> When I test each version of the code at its own most efficient
> maintenance_work_mem, I get
> 3007.2 seconds at 1GB for patched and 3836.46 seconds at 64MB for unpatched.

As I said, it seems a little bit unfair to hand-tune work_mem or
maintenance_work_mem like that. Who can afford to do that? I think you
agree that it's untenable to have DBAs allocate work_mem differently
for cases where an internal sort or external sort is expected;
workloads are just far too complicated and changeable.

> I'm attaching the trace_sort output from the client log for all 4 of
> those scenarios. "sort_0005" means all 5 of your patches were
> applied, "origin" means none of them were.

Thanks for looking at this. This is very helpful. It looks like the
server you used here had fairly decent disks, and that we tended to be
CPU bound more often than not. That's a useful testing ground.

Consider run #7 (of 13 total) with 3GB maintenance_work_mem, for
example (this run was picked at random):

...
LOG: finished writing run 6 to tape 5: CPU 35.13s/1028.44u sec
elapsed 1080.43 sec
LOG: starting quicksort of run 7: CPU 38.15s/1051.68u sec elapsed 1108.19 sec
LOG: finished quicksorting run 7: CPU 38.16s/1228.09u sec elapsed 1284.87 sec
LOG: finished writing run 7 to tape 6: CPU 40.21s/1235.36u sec
elapsed 1295.19 sec
LOG: starting quicksort of run 8: CPU 42.73s/1257.59u sec elapsed 1321.09 sec
...

So there was 27.76 seconds spent copying tuples into local memory
ahead of the quicksort, 2 minutes 56.68 seconds spent actually
quicksorting, and a trifling 10.32 seconds actually writing the run! I
bet that the quicksort really didn't use up too much memory bandwidth
on the system as a whole, since abbreviated keys are used with a cache
oblivious internal sorting algorithm.

This suggests that this case would benefit rather a lot from parallel
workers doing this for each run at the same time (once my code is
adopted to do that, of course). This is something I'm currently
researching. I think that (roughly speaking) each core on this system
is likely slower than the cores on a 4-core consumer desktop/laptop,
which is very normal, particularly with x86_64 systems. That also
makes it more representative than my previous tests.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2015-11-29 01:00:01 Re: WIP: Make timestamptz_out less slow.
Previous Message Jeff Janes 2015-11-28 22:19:35 Re: Using quicksort for every external sort run