Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-09-10 00:22:01
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Thu, Sep 8, 2016 at 2:18 PM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>>> setup:
>>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
>>> insert into lotsofitext select cast(random() * 1000000000.0 as text)
>>> || 'blablablawiiiiblabla', cast(random() * 1000000000.0 as text) ||
>>> 'blablablawjjjblabla', cast(random() * 1000000000.0 as text) ||
>>> 'blablabl
>>> awwwabla', random() * 1000000000.0, random() * 1000000000000.0 from
>>> generate_series(1, 10000000);
>>> timed:
>>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>>> Unpatched Time: 100351.251 ms
>>> Patched Time: 75180.787 ms
>>> That's like a 25% speedup on random input. As we say over here, rather
>>> badly translated, not a turkey's boogers (meaning "nice!")
>> Cool! What work_mem setting were you using here?
> The script iterates over a few variations of string patterns (easy
> comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB,
> 4GB), and table sizes (~350M, ~650M, ~1.5G).
> That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.

Well, the worst regression I see is under the noise for this test
(which seems rather high at 5%, but it's expectable since it's mostly
big queries).

Most samples show an improvement, either marginal or significant. The
most improvement is, naturally, on low work_mem settings. I don't see
significant slowdown on work_mem settings that should result in just a
few tapes being merged, but I didn't instrument to check how many
tapes were being merged in any case.

Attached are the results both in ods, csv and raw formats.

I think these are good results.

So, to summarize the review:

- Patch seems to follow the coding conventions of surrounding code
- Applies cleanly on top of25794e841e5b86a0f90fac7f7f851e5d950e51e2,
plus patches 1 and 2.
- Builds without warnings
- Passes regression tests
- IMO has sufficient coverage from existing tests (none added)
- Does not introduce any significant performance regression
- Best improvement of 67% (reduction of runtime to 59%)
- Average improvement of 30% (reduction of runtime to 77%)
- Worst regression of 5% (increase of runtime to 105%), which is under
the noise for control queries, so not significant
- Performance improvement is highly quite desirable in this merge
step, as it's a big bottleneck on parallel sort (and seems also
regular sort)
- All testing was done on random input, presorted input *will* show
more pronounced improvements

I suggested to change a few asserts in tuplesort_heap_root_displace to
make the debug code stricter in checking the assumptions, but they're
not blockers:

+ Assert(state->memtupcount > 1 || imin == 0);
+ memtuples[imin] = *newtup;


+ Assert(imin < state->memtupcount);
+ memtuples[imin] = *newtup;

And, perhaps as well,

+ Assert(memtuples[0].tupindex == newtup->tupindex);


+ Assert(state->memtupcount > 0 && memtuples[0].tupindex ==

It was suggested that both tuplesort_heap_siftup and
tuplesort_heap_root_displace could be wrappers around a common
"siftup" implementation, since the underlying operation is very

Since it is true that doing so would make it impossible to keep the
asserts about tupindex in tuplesort_heap_root_displace, I guess it
depends on how useful those asserts are (ie: how likely it is that
those conditions could be violated, and how damaging it could be if
they were). If it is decided the refactor is desirable, I'd suggest
making the common siftup producedure static inline, to allow
tuplesort_heap_root_displace to inline and specialize it, since it
will be called with checkIndex=False and that simplifies the resulting
code considerably.

Peter also mentioned that there were some other changes going on in
the surrounding code that could impact this patch, so I'm marking the
patch Waiting on Author.

Overall, however, I believe the patch is in good shape. Only minor
form issues need to be changed, the functionality seems both desirable
and ready.

Attachment Content-Type Size
root_displace_patched_timings.log text/plain 70.7 KB application/x-sh 634 bytes
root_displace_tests.sql application/sql 3.7 KB
root_displace_tests_setup.sql application/sql 933 bytes
root_displace_tests_slowsetup.sql application/sql 1.1 KB
root_displace_timings.csv text/csv 64.6 KB
root_displace_timings.ods application/vnd.oasis.opendocument.spreadsheet 63.4 KB
root_displace_unpatched_timings.log text/plain 70.7 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2016-09-10 00:29:13 Re: Parallel tuplesort (for parallel B-Tree index creation)
Previous Message Tom Lane 2016-09-09 23:04:45 Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]