Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Rushabh Lathia <rushabh(dot)lathia(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Subject: Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date: 2018-01-10 18:45:14
Message-ID: CA+TgmobOwH=h8ZQ7Kn72hnteJbReZ=e3CbKsRiStZXiZYakTjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jan 9, 2018 at 10:36 PM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> This looks good to me.

The addition to README.parallel is basically wrong, because workers
have been allowed to write WAL since the ParallelContext machinery.
See the
XactLastRecEnd handling in parallel.c. Workers can, for example, due
HOT cleanups during SELECT scans, just as the leader can. The
language here is obsolete anyway in light of commit
e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way
to update it. I'll propose a separate patch for that.

The change to the ParallelContext signature in parallel.h makes an
already-overlength line even longer. A line break seems warranted
just after the first argument, plus pgindent afterward.

I am not a fan of the leader-as-worker terminology. The leader is not
a worker, full stop. I think we should instead talk about whether the
leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates,
for example, plus many comment updates). Similarly, it seems
SortCoordinateData's nLaunched should be nParticipants, and BTLeader's
nworkertuplesorts should be nparticipanttuplesorts.

There is also the question of whether we want to respect
parallel_leader_participation in this context. The issues which might
motivate the desire for such behavior in the context of a query do not
exist when creating a btree index, so maybe we're just making this
complicated. On the other hand, if some other type of parallel index
build does end up doing a Gather-like operation then we might regret
deciding that parallel_leader_participation doesn't apply to index
builds, so maybe it's OK the way we have it. On the third hand, the
complexity of having the leader maybe-participate seems like it
extends to a fair number of places in the code, and getting rid of all
that complexity seems appealing.

One place where this actually causes a problem is the message changes
to index_build(). The revised ereport() violates translatability
guidelines, which require that messages not be assembled from pieces.
See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES

A comment added to tuplesort.h says that work_mem should be at least
64KB, but does not give any reason. I think one should be given, at
least briefly, so that someone looking at these comments in the future
can, for example, figure out whether the comment is still correct
after future code changes. Or else, remove the comment.

+ * Parallel sort callers are required to coordinate multiple tuplesort states
+ * in a leader process, and one or more worker processes. The leader process

I think the comma should be removed. As written it, it looks like we
are coordinating multiple tuplesort states in a leader process, and,
separately, we are coordinating one or more worker processes. But in
fact we are coordinating multiple tuplesort states which are in a
group of processes that includes the leader and one or more worker
processes.

Generally, I think the comments in tuplesort.h are excellent. I
really like the overview of how the new interfaces should be used,
although I find it slightly wonky that the leader needs two separate
Tuplesortstates if it wants to participate.

I don't understand why this patch needs to tinker with the tests in
vacuum.sql. The comments say that "If we did not do this, errors
raised would concern running ANALYZE in parallel mode." However, why
should parallel CREATE INDEX having any impact on ANALYZE at all?
Also, as a practical matter, if I revert those changes, 'make check'
still passes with or without force_parallel_mode=on.

I really dislike the fact that this patch invents another thing for
force_parallel_mode to do. I invented force_parallel_mode mostly as a
way of testing that functions were correctly labeled for
parallel-safety, and I think it would be just fine if it never does
anything else. As it is, it already does two quite separate things to
accomplish that goal: (1) forcibly run the whole plan with parallel
mode restrictions enabled, provided that the plan is not
parallel-unsafe, and (2) runs the plan in a worker, provided that the
plan is parallel-safe. There's a subtle difference between those two
condition, which is that not parallel-unsafe does not equal
parallel-safe; there is also parallel-restricted. The fact that
force_parallel_mode controls two different behaviors has, I think,
already caused some confusion for prominent PostgreSQL developers and,
likely, users as well. Making it do a third thing seems to me to be
adding to the confusion, and not only because there are no
documentation changes to match. If we go down this road, there will
probably be more additions -- what happens when parallel VACUUM
arrives, or parallel CLUSTER, or whatever? I don't think it will be a
good thing for PostgreSQL if we end up with force_parallel_mode=on as
a general "use parallelism even though it's stupid" flag, requiring
supporting code in many different places throughout the code base and
a laundry list of not-actually-useful behavior changes in the
documentation.

What I think would be a lot more useful, and what I sort of expected
the patch to have, is a way for a user to explicitly control the
number of workers requested for a CREATE INDEX operation. We all know
that the cost model is crude and that may be OK -- though it would be
interesting to see some research on what the run times actually look
like for various numbers of workers at various table sizes and
work_mem settings -- but it will be inconvenient for DBAs who actually
know what number of workers they want to use to instead get whatever
value plan_create_index_workers() decide to emit. They can force it
by setting the parallel_workers reloption, but that affects queries.
They can probably also do it by setting min_parallel_table_scan_size =
0 and max_parallel_workers_maintenance to whatever value they want,
but I think it would be convenient for there to be a more
straightforward way to do it, or at least some documentation in the
CREATE INDEX page about how to get the number of workers you really
want. To be clear, I don't think that this is a must-fix issue for
this patch to get committed, but I do think that all reference to
force_parallel_mode=on should go away.

I do not like the way that this patch wants to turn the section of the
documentation on when parallel query can be used into a discussion of
when parallelism can be used. I think it would be better to leave
that section alone and instead document under CREATE INDEX the
concerns specific to parallel index build. I think this will be easier
for users to understand and far easier to maintain as the number of
parallel DDL operations increases, which I expect it to do somewhat
explosively. The patch as written says things like "If a utility
statement that is expected to do so does not produce a parallel plan,
..." but, one, utility statements *do not produce plans of any type*
and, two, the concerns here are really specific to parallel CREATE
INDEX and there is every reason to think that they might be different
in other cases. I feel strongly that it's enough for this section to
try to explain the concerns that pertain to optimizable queries and
leave utility commands to be treated elsewhere. If we find that we're
accumulating a lot of documentation for various parallel utility
commands that seems to be duplicative, we can write a general
treatment of that topic that is separate from this one.

The documentation for max_parallel_workers_maintenance cribs from the
documentation for max_parallel_workers_per_gather in saying that we'll
use fewer workers than expected "which may be inefficient". However,
for parallel CREATE INDEX, that trailing clause is, at least as far as
I can see, not applicable. For a query, we might choose a Gather over
a Parallel Seq Scan because we think we've got a lot of workers; with
only one participant, we might prefer a GIN index scan. If it turns
out we don't get the workers, we've got a clearly suboptimal plan.
For CREATE INDEX, though, it seems to me that we don't make any
decisions based on the number of workers we think we'll have. If we
get fewer workers, it may be slower, but it shouldn't still be as fast
as it can be with that number of workers, which for queries is not the
case.

+ * These fields are not modified throughout the sort. They primarily
+ * exist for the benefit of worker processes, that need to create BTSpool
+ * state corresponding to that used by the leader.

throughout -> during

remove comma

+ * builds, that must work just the same when an index is built in

remove comma

+ * State that is aggregated by workers, to report back to leader.

State that is maintained by workers and reported back to leader.

+ * indtuples is the total number of tuples that made it into index.

into the index

+ * btleader is only present when a parallel index build is performed, and
+ * only in leader process (actually, only the leader has a BTBuildState.
+ * Workers have their own spool and spool2, though.)

the leader process
period after "process"
capitalize actually

+ * Done. Leave a way for leader to determine we're finished. Record how
+ * many tuples were in this worker's share of the relation.

I don't understand what the "Leave a way" comment means.

+ * To support parallel sort operations involving coordinated callers to
+ * tuplesort.c routines across multiple workers, it is necessary to
+ * concatenate each worker BufFile/tapeset into one single leader-wise
+ * logical tapeset. Workers should have produced one final materialized
+ * tape (their entire output) when this happens in leader; there will always
+ * be the same number of runs as input tapes, and the same number of input
+ * tapes as workers.

I can't interpret the word "leader-wise". A partition-wise join is a
join done one partition at a time, but a leader-wise logical tape set
is not done one leader at a time. If there's another meaning to the
affix -wise, I'm not familiar with it. Don't we just mean "a single
logical tapeset managed by the leader"?

There's a lot here I haven't grokked yet, but I'm running out of
mental energy so I think I'll send this for now and work on this some
more when time permits, hopefully tomorrow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2018-01-10 18:45:25 Re: General purpose hashing func in pgbench
Previous Message Fabien COELHO 2018-01-10 18:42:26 Re: General purpose hashing func in pgbench