Re: Using quicksort for every external sort run

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, Greg S <stark(at)mit(dot)edu>
Subject: Re: Using quicksort for every external sort run
Date: 2016-01-29 16:53:46
Message-ID: CAM3SWZSdRUTZeLLfQUa0ThsWjbHU5t4y-Ef=PEU9KE_PfF9yJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 29, 2016 at 3:41 AM, Mithun Cy <mithun(dot)cy(at)enterprisedb(dot)com> wrote:
> I just ran some tests on above patch. Mainly to compare
> how "longer sort keys" would behave with new(Qsort) and old Algo(RS) for sorting.
> I have 8GB of ram and ssd storage.
>
> Settings and Results.
> ----------------------------
> Work_mem= DEFAULT (4mb).
> key width = 520.

> If data is sorted as same as sort key order then current code performs better than proposed patch
> as sort size increases.
>
> It appears new algo do not seem have any major impact if rows are presorted in opposite order.
>
> For randomly distributed order quick sort performs well when compared to current sort method (RS).
>
>
> ======================================================
> Now Increase the work_mem to 64MB and for 14 GB of data to sort.
>
> CASE 1: We can see Qsort is able to catchup with current sort method(RS).
> CASE 2: No impact.
> CASE 3: RS is able to catchup with Qsort.

I think that the basic method you're using to do these tests may have
additional overhead:

-- sort in ascending order.
CREATE FUNCTION test_orderby_asc( ) RETURNS int
AS $$
#print_strict_params on
DECLARE
gs int;
jk text;
BEGIN
SELECT string_4k, generate_series INTO jk, gs
FROM so order by string_4k, generate_series;

RETURN gs;
END
$$ LANGUAGE plpgsql;

Anyway, these test cases all remove much of the advantage of increased
cache efficiency. No comparisons are *ever* resolved using the
leading attribute, which calls into question why anyone would sort on
that. It's 512 bytes, so artificially makes the comparisons themselves
the bottleneck, as opposed to cache efficiency. You can't even fit the
second attribute in the same cacheline as the first in the "tuple
proper" (MinimalTuple).

You are using a 4MB work_mem setting, but you almost certainly have a
CPU with an L3 cache size that's a multiple of that, even with cheap
consumer grade hardware. You have 8GB of ram; a 4MB work_mem setting
is very small setting (I mean in an absolute sense, less so than
relative to the size of data, although especially relative to the
data).

You mentioned "CASE 3: RS is able to catchup with Qsort", which
doesn't make much sense to me. The only way I think that is possible
is by making the increased work_mem sufficient to have much longer
runs, because there is in fact somewhat of a correlation in the data,
and an increased work_mem makes the critical difference, allowing
perhaps one long run to be used -- there is now enough memory to
"juggle" tuples without ever needing to start a new run. But, how
could that be? You said case 3 was totally random data, so I'd only
expect incremental improvement. It could also be some weird effect
from polyphase merge. A discontinuity.

I also don't understand why the patch ("Qsort") can be so much slower
between case 1 and case 3 on 3.5GB+ sizes, but not the 1.7GB size.
Even leaving aside the differences between "RS" and "Qsort", it makes
no sense to me that *both* are faster with random data ("CASE 3") than
with presorted data ("CASE 1").

Another weird thing is that the traditional best case for replacement
selection ("RS") is a strong correlation, and a traditional worst case
is an inverse correlation, where run size is bound strictly by memory.
But you show just the opposite here -- the inverse correlation is
faster with RS in the 1.7 GB data case. So, I have no idea what's
going on here, and find it all very confusing.

In order for these numbers to be useful, they need more detail --
"trace_sort" output. There are enough confounding factors in general,
and especially here, that not having that information makes raw
numbers very difficult to interpret.

> I think for long keys both old (RS) and new (Qsort) sort method has its own characteristics
> based on data distribution. I think work_mem is the key If properly set new method(Qsort) will
> be able to fit most of the cases. If work_mem is not tuned right it, there are cases it can regress.

work_mem is impossible to tune right with replacement selection.
That's a key advantage of the proposed new approach.

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Catalin Iacob 2016-01-29 17:09:19 Re: proposal: PL/Pythonu - function ereport
Previous Message Anastasia Lubennikova 2016-01-29 16:50:46 Re: [WIP] Effective storage of duplicates in B-tree index.