Re: Polyphase merge is obsolete

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Polyphase merge is obsolete
Date: 2017-01-17 01:56:10
Message-ID: CAM3SWZRLXEkuNX=mF_Grtc0CJ1TwOSAsrJd8Hnpwx+kz4A=DLw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 12, 2016 at 10:16 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> The number of *input* tapes we can use in each merge pass is still limited,
> by the memory needed for the tape buffers and the merge heap, but only one
> output tape is active at any time. The inactive output tapes consume very
> few resources. So the whole idea of trying to efficiently reuse input tapes
> as output tapes is pointless

I picked this up again. The patch won't apply cleanly. Can you rebase?
Also, please look at my bugfix for logtape.c free block management [1]
before doing so, as that might be prerequisite. Finally, I don't think
that the Logical tape pause/resume idea is compelling. Is it hard to
not do that, but still do everything else that you propose here?
That's what I lean towards doing right now.

Anyway, efficient use of tapes certainly mattered a lot more when
rewinding meant sitting around for a magnetic tape deck to physically
rewind. There is another algorithm in AoCP Vol III that lets us write
to tapes backwards, actually, which is motivated by similar obsolete
considerations about hardware. Why not write while we rewind, to avoid
doing nothing else while rewinding?!

Perhaps this patch should make a clean break from the "rewinding"
terminology. Perhaps you should rename LogicalTapeRewindForRead() to
LogicalTapePrepareForRead(), and so on. It's already a bit awkward
that that routine is called LogicalTapeRewindForRead(), because it
behaves significantly differently when a tape is frozen, and because
the whole point of logtape.c is space reuse that is completely
dissimilar to rewinding. (Space reuse is thus totally unlike how
polyphase merge is supposed to reuse space, which is all about
rewinding, and isn't nearly as eager. Same applies to K-way balanced
merge, of course.)

I think that the "rewinding" terminology does more harm than good, now
that it doesn't even help the Knuth reader to match Knuth's remarks to
what's going on in tuplesort.c. Just a thought.

> Let's switch over to a simple k-way balanced merge. Because it's simpler. If
> you're severely limited on memory, like when sorting 1GB of data with
> work_mem='1MB' or less, it's also slightly faster. I'm not too excited about
> the performance aspect, because in the typical case of a single-pass merge,
> there's no difference. But it would be worth changing on simplicity grounds,
> since we're mucking around in tuplesort.c anyway.

I actually think that the discontinuities in the merge scheduling are
worse than you suggest here. There doesn't have to be as extreme a
difference between work_mem and the size of input as you describe
here. As an example:

create table seq_tab(t int);
insert into seq_tab select generate_series(1, 10000000);
set work_mem = '4MB';
select count(distinct t) from seq_tab;

The trace_sort output ends like this:

30119/2017-01-16 17:17:05 PST LOG: begin datum sort: workMem = 4096,
randomAccess = f
30119/2017-01-16 17:17:05 PST LOG: switching to external sort with 16
tapes: CPU: user: 0.07 s, system: 0.00 s, elapsed: 0.06 s
30119/2017-01-16 17:17:05 PST LOG: starting quicksort of run 1: CPU:
user: 0.07 s, system: 0.00 s, elapsed: 0.06 s
30119/2017-01-16 17:17:05 PST LOG: finished quicksort of run 1: CPU:
user: 0.07 s, system: 0.00 s, elapsed: 0.07 s
*** SNIP ***
30119/2017-01-16 17:17:08 PST LOG: finished writing run 58 to tape 0:
CPU: user: 2.50 s, system: 0.27 s, elapsed: 2.78 s
30119/2017-01-16 17:17:08 PST LOG: using 4095 KB of memory for read
buffers among 15 input tapes
30119/2017-01-16 17:17:08 PST LOG: finished 1-way merge step: CPU:
user: 2.52 s, system: 0.28 s, elapsed: 2.80 s
30119/2017-01-16 17:17:08 PST LOG: finished 4-way merge step: CPU:
user: 2.58 s, system: 0.30 s, elapsed: 2.89 s
30119/2017-01-16 17:17:08 PST LOG: finished 14-way merge step: CPU:
user: 2.86 s, system: 0.34 s, elapsed: 3.20 s
30119/2017-01-16 17:17:08 PST LOG: finished 14-way merge step: CPU:
user: 3.09 s, system: 0.41 s, elapsed: 3.51 s
30119/2017-01-16 17:17:09 PST LOG: finished 15-way merge step: CPU:
user: 3.61 s, system: 0.52 s, elapsed: 4.14 s
30119/2017-01-16 17:17:09 PST LOG: performsort done (except 15-way
final merge): CPU: user: 3.61 s, system: 0.52 s, elapsed: 4.14 s
30119/2017-01-16 17:17:10 PST LOG: external sort ended, 14678 disk
blocks used: CPU: user: 4.93 s, system: 0.57 s, elapsed: 5.51 s

(This is the test case that Cy posted earlier today, for the bug that
was just fixed in master.)

The first 1-way merge step is clearly kind of a waste of time. We
incur no actual comparisons during this "merge", since there is only
one real run from each input tape (all other active tapes contain only
dummy runs). We are, in effect, just shoveling the tuples from that
single run from one tape to another (from one range in the underlying
logtape.c BufFile space to another). I've seen this quite a lot
before, over the years, while working on sorting. It's not that big of
a deal, but it's a degenerate case that illustrates how polyphase
merge can do badly. What's probably of more concern here is that the
final on-the-fly merge we see merges 15 input runs, but 10 of the
inputs only have runs that are sized according to how much we could
quicksort at once (actually, I should say 11 out of 15, what with that
1-way merge). So, 4 of those runs are far larger, which is not great.
ISTM that having the merge be "balanced" is pretty important, at least
as far as multi-pass sorts in the year 2017 go.

Still, I agree with you that this patch is much more about refactoring
than about performance, and I agree that the refactoring is
worthwhile. I just think that you may have somewhat underestimated how
much this could help when we're low on memory, but not ridiculously
so. It doesn't seem necessary to prove this, though. (We need only
verify that there are no regressions.)

That's all I have right now.

[1] https://commitfest.postgresql.org/13/955/
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-01-17 02:22:45 Re: Patch to implement pg_current_logfile() function
Previous Message Haribabu Kommi 2017-01-17 01:19:40 Re: pg_hba_file_settings view patch