Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-08-05 16:06:19
Message-ID: CA+TgmoY5JYs4R1g_ZJ-P6SkULSb19xx4zUh7S8LJiXonCgVTuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 3, 2016 at 5:13 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I'm not going to say it's bad to be able to do things 2-2.5x faster,
>> but linear scalability this ain't - particularly because your 2.58x
>> faster case is using up to 7 or 8 times as much memory. The
>> single-process case would be faster in that case, too: you could
>> quicksort.
>
> [ lengthy counter-argument ]

None of this convinces me that testing this in a way that is not
"apples to apples" is a good idea, nor will any other argument.

>> I also think that Amdahl's law is going to pinch pretty severely here.
>
> Doesn't that almost always happen, though?

To some extent, sure, absolutely. But it's our job as developers to
try to foresee and minimize those cases. When Noah was at
EnterpriseDB a few years ago and we were talking about parallel
internal sort, Noah started by doing a survey of the literature and
identified parallel quicksort as the algorithm that seemed best for
our use case. Of course, every time quicksort partitions the input,
you get two smaller sorting problems, so it's easy to see how to use 2
CPUs after the initial partitioning step has been completed and 4 CPUs
after each of those partitions has been partitioned again, and so on.
However, that turns out not to be good enough because the first
partitioning step can consume a significant percentage of the total
runtime - so if you only start parallelizing after that, you're
leaving too much on the table. To avoid that, the algorithm he was
looking at had a (complicated) way of parallelizing the first
partitioning step; then you can, it seems, do the full sort in
parallel.

There are some somewhat outdated and perhaps naive ideas about this
that we wrote up here:

https://wiki.postgresql.org/wiki/Parallel_Sort

Anyway, you're proposing an algorithm that can't be fully
parallelized. Maybe that's OK. But I'm a little worried about it.
I'd feel more confident if we knew that the merge could be done in
parallel and were just leaving that to a later development stage; or
if we picked an algorithm like the one above that doesn't leave a
major chunk of the work unparallelizable.

> Isn't that what you
> generally see with queries that show off the parallel join capability?

For nested loop joins, no. The whole join operation can be done in
parallel. For hash joins, yes: building the hash table once per
worker can run afoul of Amdahl's law in a big way. That's why Thomas
Munro is working on fixing it:

https://wiki.postgresql.org/wiki/EnterpriseDB_database_server_roadmap

Obviously, parallel query is subject to a long list of annoying
restrictions at this point. On queries that don't hit any of those
restrictions we can get 4-5x speedup with a leader and 4 workers. As
we expand the range of plan types that we can construct, I think we'll
see those kinds of speedups for a broader range of queries. (The
question of exactly why we top out with as few workers as currently
seems to be the case needs more investigation, too; maybe contention
effects?)

>> If the final merge phase is a significant percentage of the total
>> runtime, picking an algorithm that can't parallelize the final merge
>> is going to limit the speedups to small multiples. That's an OK place
>> to be as a result of not having done all the work yet, but you don't
>> want to get locked into it. If we're going to have a substantial
>> portion of the work that can never be parallelized, maybe we've picked
>> the wrong algorithm.
>
> I suggest that this work be compared to something with similar
> constraints. I used Google to try to get some indication of how much
> of a difference parallel CREATE INDEX makes in other major database
> systems. This is all I could find:
>
> https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/

I do agree that it is important not to have unrealistic expectations.

> As I've said, there is probably a good argument to be made for
> partitioning to increase parallelism. But, that involves risks around
> the partitioning being driven by statistics or a cost model, and I
> don't think you'd be too on board with the idea of every CREATE INDEX
> after bulk loading needing an ANALYZE first. I tend to think of that
> as more of a parallel query thing, because you can often push down a
> lot more there, dynamic sampling might be possible, and there isn't a
> need to push all the tuples through one point in the end. Nothing I've
> done here precludes your idea of a sort-order-preserving gather node.
> I think that we may well need both.

Yes. Rushabh is working on that, and Finalize GroupAggregate ->
Gather Merge -> Partial GroupAggregate -> Sort -> whatever is looking
pretty sweet.

>> The work on making the logtape infrastructure parallel-aware seems
>> very interesting and potentially useful for other things. Sadly, I
>> don't have time to look at it right now.
>
> I would be happy to look at generalizing that further, to help
> parallel hash join. As you know, Thomas Munro and I have discussed
> this privately.

Right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-08-05 16:10:21 Re: Re: [sqlsmith] FailedAssertion("!(k == indices_count)", File: "tsvector_op.c", Line: 511)
Previous Message Peter Eisentraut 2016-08-05 16:04:19 money type overflow checks