Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-10-19 18:33:32
Message-ID: CAM3SWZT54zyKE+JSch9RN6u_0BrBoTKk3Vr8FKpvFM+GyLK2jA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 19, 2016 at 7:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Gather Merge can't emit a tuple unless it has buffered at least one
> tuple from every producer; otherwise, the next tuple it receives from
> one of those producers might proceed whichever tuple it chooses to
> emit. However, it doesn't need to wait until all of the workers are
> completely done. The leader only needs to be at least slightly ahead
> of the slowest worker. I'm not sure how that compares to Peter's
> approach.

I don't think that eager merging will prove all that effective,
however it's implemented. I see a very I/O bound system when parallel
CREATE INDEX merges serially. There is no obvious reason why you'd
have a straggler worker process with CREATE INDEX, really.

> What I'm worried about is that we're implementing two separate systems
> to do the same thing, and that the parallel sort approach is actually
> a lot less general. I think it's possible to imagine a Parallel Sort
> implementation which does things Gather Merge can't. If all of the
> workers collaborate to sort all of the data rather than each worker
> sorting its own data, then you've got something which Gather Merge
> can't match. But this is not that.

It's not that yet, certainly. I think I've sketched a path forward for
making partitioning a part of logtape.c that is promising. The sharing
of ranges within tapes and so on will probably have a significant
amount in common with what I've come up with.

I don't think that any executor infrastructure is a particularly good
model when *batch output* is needed -- the tuple queue mechanism will
be a significant bottleneck, particularly because it does not
integrate read-ahead, etc. The best case that I saw advertised for
Gather Merge was TPC-H query 9 [1]. That doesn't look like a good
proxy for how Gather Merge adapted to parallel CREATE INDEX would do,
since it benefits from the GroupAggregate merge having many equal
values, possibly with a clustering in the original tables that can
naturally be exploited (no TID tiebreaker needed, since IndexTuples
are not being merged). Also, it looks like Gather Merge may do that
well by enabling things, rather than parallelizing the sort
effectively per se. Besides, the query 9 case is significantly less
scalable than good cases for this parallel CREATE INDEX patch have
already been shown to be.

I think I've been pretty modest about what this parallel CREATE INDEX
patch gets us from the beginning. It is a generalization of
tuplesort.c to work in parallel; we need a lot more for that to make
things like GroupAggregate as scalable as possible, and I don't
pretend that this helps much with that. There are actually more
changes to nbtsort.c to coordinate all of this than there are to
tuplesort.c in the latest version, so I think that this simpler
approach for parallel CREATE INDEX and CLUSTER is worthwhile.

The bottom line is that it's inherently difficult for me to refute the
idea that Gather Merge could do just as well as what I have here,
because proving that involves adding a significant amount of new
infrastructure (e.g., to teach the executor about IndexTuples). I
think that the argument for this basic approach is sound (it appears
to offer comparable scalability to the parallel CREATE INDEX
implementations of other systems), but it's simply impractical for me
to offer much reassurance beyond that.

[1] https://github.com/tvondra/pg_tpch/blob/master/dss/templates/9.sql
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2016-10-19 18:52:08 Re: Gather Merge
Previous Message Thom Brown 2016-10-19 17:44:29 Re: Patch: Implement failover on libpq connect level.