Re: [HACKERS] [PATCH] Incremental sort

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Antonin Houska <ah(at)cybertec(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] [PATCH] Incremental sort
Date: 2018-03-07 23:49:39
Message-ID: dbf8e5f5-3e4d-c742-b508-45e41328b162@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox
Thread:
Lists: pgsql-hackers

On 03/05/2018 11:07 PM, Alexander Korotkov wrote:
> Hi!
>
> Thank you for reviewing this patch!
> Revised version is attached.
>

OK, the revised patch works fine - I've done a lot of testing and
benchmarking, and not a single segfault or any other crash.

Regarding the benchmarks, I generally used queries of the form

SELECT * FROM (SELECT * FROM t ORDER BY a) foo ORDER BY a,b

with the first sort done in various ways:

* regular Sort node
* indexes with Index Scan
* indexes with Index Only Scan

and all these three options with and without LIMIT (the limit was set to
1% of the source table).

I've also varied parallelism (max_parallel_workers_per_gather was set to
either 0 or 2), work_mem (from 4MB to 256MB) and data set size (tables
from 1000 rows to 10M rows).

All of this may seem like an overkill, but I've found a couple of
regressions thanks to that.

The full scripts and results are available here:

https://github.com/tvondra/incremental-sort-tests

The queries actually executed are a bit more complicated, to eliminate
overhead due to data transfer to client etc. The same approach was used
in the other sorting benchmarks we've done in the past.

I'm attaching results for two scales - 10k and 10M rows, preprocessed
into .ods format. I haven't looked at the other scales yet, but I don't
expect any surprises there.

Each .ods file contains raw data for one of the tests (matching the .sh
script filename), pivot table, and comparison of durations with and
without the incremental sort.

In general, I think the results look pretty impressive. Almost all the
comparisons are green, which means "faster than master" - usually by
tens of percent (without limit), or by up to ~95% (with LIMIT).

There are a couple of regressions in two cases sort-indexes and
sort-indexes-ios.

Oh the small dataset this seems to be related to the number of groups
(essentially, number of distinct values in a column). My assumption is
that there is some additional overhead when "switching" between the
groups, and with many groups it's significant enough to affect results
on these tiny tables (where master only takes ~3ms to do the sort). The
slowdown seems to be

On the large data set it seems to be somehow related to both work_mem
and number of groups, but I didn't have time to investigate that yet
(there are explain analyze plans in the results, so feel free to look).

In general, I think this looks really nice. It's certainly awesome with
the LIMIT case, as it allows us to leverage indexes on a subset of the
ORDER BY columns.

Now, there's a caveat in those tests - the data set is synthetic and
perfectly random, i.e. all groups equally likely, no correlations or
anything like that.

I wonder what is the "worst case" scenario, i.e. how to construct a data
set with particularly bad behavior of the incremental sort.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
sort-10000.tgz application/x-compressed-tar 332.7 KB
sort-10000000.tgz application/x-compressed-tar 406.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tsunakawa, Takayuki 2018-03-08 00:28:04 RE: Protect syscache from bloating with negative cache entries
Previous Message Michael Paquier 2018-03-07 23:43:30 Re: Server won't start with fallback setting by initdb.