Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-09-06 07:34:54
Message-ID: 0939f43f-226b-ee57-4a27-2d1b8e18185c@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm reviewing patches 1-3 in this series, i.e. those patches that are
not directly related to parallelism, but are independent improvements to
merging.

Let's begin with patch 1:

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> Cap the number of tapes used by external sorts
>
> Commit df700e6b set merge order based on available buffer space (the
> number of tapes was as high as possible while still allowing at least 32
> * BLCKSZ buffer space per tape), rejecting Knuth's theoretically
> justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's P),
> improving performance when the sort thereby completed in one pass.
> However, it's still true that there are unlikely to be benefits from
> increasing the number of tapes past 7 once the amount of data to be
> sorted significantly exceeds available memory; that commit probably
> mostly just improved matters where it enabled all merging to be done in
> a final on-the-fly merge.
>
> One problem with the merge order logic established by that commit is
> that with large work_mem settings and data volumes, the tapes previously
> wasted as much as 8% of the available memory budget; tens of thousands
> of tapes could be logically allocated for a sort that will only benefit
> from a few dozen.

Yeah, wasting 8% of the memory budget on this seems like a bad idea. If
I understand correctly, that makes the runs shorter than necessary,
leading to more runs.

> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
> tuplesort will ever use (i.e. merge order is capped at 500 inclusive).
> This is a conservative estimate of the number of runs at which doing all
> merging on-the-fly no longer allows greater overlapping of I/O and
> computation.

Hmm. Surely there are cases, so that with > 501 tapes you could do it
with one merge pass, but now you need two? And that would hurt
performance, no?

Why do we reserve the buffer space for all the tapes right at the
beginning? Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD)
callin inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every
time we start a new run, until we reach maxTapes?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ivan Kartyshov 2016-09-06 07:39:38 Re: make async slave to wait for lsn to be replayed
Previous Message Ashutosh Bapat 2016-09-06 07:13:39 Re: Declarative partitioning - another take