Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-08-03 21:13:08
Message-ID: CAM3SWZQ=kYDF1s0TQsyZvmwUzV45eK7V21S5igjBSdkdAMNN1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 3, 2016 at 11:42 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I'm not going to say it's bad to be able to do things 2-2.5x faster,
> but linear scalability this ain't - particularly because your 2.58x
> faster case is using up to 7 or 8 times as much memory. The
> single-process case would be faster in that case, too: you could
> quicksort.

Certainly, there are cases where a parallel version could benefit from
having more memory more so than actually parallelizing the underlying
task. However, this case was pointedly chosen to *not* be such a case.
When maintenance_work_mem exceeds about 5GB, I've observed that since
9.6 increasing it is just as likely to hurt as to help by about +/-5%
(unless and until it's all in memory, which still doesn't help much).
In general, there isn't all that much point in doing a very large sort
like this in memory. You just don't get that much of a benefit for the
memory you use, because linearithmic CPU costs eventually really
dominate linear sequential I/O costs.

I think you're focusing on the fact that there is a large absolute
disparity in memory used in this one benchmark, but that isn't
something that the gains shown particularly hinge upon. There isn't
that much difference when workers must merge their own runs, for
example. It saves the serial leader merge some work, and in particular
makes it more cache efficient (by having fewer runs/tapes).

Finally, while about 8x as much memory is used, the memory used over
and above the serial case is almost all freed when the final merge
begins (the final merges are therefore very similar in both cases,
including in terms of memory use). So, for as long as you use 8x as
much memory for 8 active processes, you get a 5.67x speed-up of that
part alone. You still keep a few extra KiBs of memory for worker tapes
and things like that during the leader's merge, but that's a close to
negligible amount.

> I feel like for sorting, in particular, we probably ought
> to be setting the total memory budget, not the per-process memory
> budget. Or if not, then any CREATE INDEX benchmarking had better
> compare using scaled values for maintenance_work_mem; otherwise,
> you're measuring the impact of using more memory as much as anything
> else.

As I said, the benchmark was chosen to avoid that (and to be simple
and reproducible). I am currently neutral on the question of whether
or not maintenance_work_mem should be dolled out per process or per
sort operation. I do think that making it a per-process allowance is
far closer to what we do for hash joins today, and is simpler.

What's nice about the idea of making the workMem/maintenance_work_mem
budget per sort is that that leaves the leader process with license to
greatly increase the amount of memory it can use for the merge.
Increasing the amount of memory used for the merge will improve things
for longer than it will for workers. I've simulated it already.

> I also think that Amdahl's law is going to pinch pretty severely here.

Doesn't that almost always happen, though? Isn't that what you
generally see with queries that show off the parallel join capability?

> If the final merge phase is a significant percentage of the total
> runtime, picking an algorithm that can't parallelize the final merge
> is going to limit the speedups to small multiples. That's an OK place
> to be as a result of not having done all the work yet, but you don't
> want to get locked into it. If we're going to have a substantial
> portion of the work that can never be parallelized, maybe we've picked
> the wrong algorithm.

I suggest that this work be compared to something with similar
constraints. I used Google to try to get some indication of how much
of a difference parallel CREATE INDEX makes in other major database
systems. This is all I could find:

https://www.mssqltips.com/sqlservertip/3100/reduce-time-for-sql-server-index-rebuilds-and-update-statistics/

It seems like the degree of parallelism used for SQL Server tends to
affect index build time in a way that is strikingly similar with what
I've come up with (which may be a coincidence; I don't know anything
about SQL Server). So, I suspect that the performance of this is
fairly good in an apples-to-apples comparison.

Parallelizing merging can hurt or help, because there is a cost in
memory bandwidth (if not I/O) for the extra passes that are used to
keep more CPUs busy, which is kind of analogous to the situation with
polyphase merge. I'm not saying that we shouldn't do that even still,
but I believe that there are sharply diminishing returns. Merging
tuple comparisons are much more expensive than quicksort tuple
comparisons, which tend to benefit from abbreviated keys a lot.

As I've said, there is probably a good argument to be made for
partitioning to increase parallelism. But, that involves risks around
the partitioning being driven by statistics or a cost model, and I
don't think you'd be too on board with the idea of every CREATE INDEX
after bulk loading needing an ANALYZE first. I tend to think of that
as more of a parallel query thing, because you can often push down a
lot more there, dynamic sampling might be possible, and there isn't a
need to push all the tuples through one point in the end. Nothing I've
done here precludes your idea of a sort-order-preserving gather node.
I think that we may well need both.

Since merging is a big bottleneck with this, we should probably also
work to address that indirectly.

> The work on making the logtape infrastructure parallel-aware seems
> very interesting and potentially useful for other things. Sadly, I
> don't have time to look at it right now.

I would be happy to look at generalizing that further, to help
parallel hash join. As you know, Thomas Munro and I have discussed
this privately.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Janes 2016-08-03 21:34:48 Re: Cache Hash Index meta page.
Previous Message Thomas Munro 2016-08-03 21:00:02 Re: Implementing full UTF-8 support (aka supporting 0x00)