Re: Parallel tuplesort (for parallel B-Tree index creation)

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Corey Huinker <corey(dot)huinker(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Parallel tuplesort (for parallel B-Tree index creation)
Date: 2016-11-08 04:28:35
Views: Raw Message | Whole Thread | Download mbox
Lists: pgsql-hackers

On Mon, Oct 24, 2016 at 6:17 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> * Cost model. Should probably attempt to guess final index size, and
>> derive calculation of number of workers from that. Also, I'm concerned
>> that I haven't given enough thought to the low end, where with default
>> settings most CREATE INDEX statements will use at least one parallel
>> worker.

> While I haven't made progress on any of these open items, I should
> still get a version out that applies cleanly on top of git tip --
> commit b75f467b6eec0678452fd8d7f8d306e6df3a1076 caused the patch to
> bitrot. I attach V4, which is a fairly mechanical rebase of V3, with
> no notable behavioral changes or bug fixes.

I attach V5. Changes:

* A big cost model overhaul. Workers are logarithmically scaled based
on projected final *index* size, not current heap size, as was the
case in V4. A new nbtpage.c routine is added to estimate a
not-yet-built B-Tree index's size, now called by the optimizer. This
involves getting average item width for indexed attributes from
pg_attribute for the heap relation. There are some subtleties here
with partial indexes, null_frac, etc. I also refined the cap applied
on the number of workers that limits too many workers being launched
when there isn't so much maintenance_work_mem.

The cost model is much improved now -- it is now more than just a
placeholder, at least. It doesn't do things like launch a totally
inappropriate number of workers to build a very small partial index.
Granted, those workers would still have something to do -- scan the
heap -- but not enough to justify launching so many (that is,
launching as many as would be launched for an equivalent non-partial

That having been said, things are still quite fudged here, and I
struggle to find any guiding principle around doing better on average.
I think that that's because of the inherent difficulty of modeling
what's going on, but I'd be happy to be proven wrong on that. In any
case, I think it's going to be fairly common for DBAs to want to use
the storage parameter to force the use of a particular number of
parallel workers.

(See also: my remarks below on how the new bt_estimate_nblocks()
SQL-callable function can give insight into the new cost model's

* Overhauled leader_mergeruns() further, to make it closer to
mergeruns(). We now always rewind input tapes. This simplification
involved refining some of the assertions within logtape.c, which is
also slightly simplified.

* 2 new testing tools are added to the final commit in the patch
series (not actually proposed for commit). I've added 2 new
SQL-callable functions to contrib/pageinspect.

The 2 new testing functions are:


bt_estimate_nblocks() provides an easy way to see the optimizer's
projection of how large the final index will be. It returns an
estimate in blocks. Example:

mgd=# analyze;
mgd=# select oid::regclass as rel,
to_char(bt_estimated_nblocks(oid)::numeric / relpages, 'FM990.990') as
from pg_class
where relkind = 'i'
order by relpages desc limit 20;

rel │
bt_estimated_nblocks │ relpages │ estimate_actual
mgd.acc_accession_idx_accid │
107,091 │ 106,274 │ 1.008
mgd.acc_accession_0 │
169,024 │ 106,274 │ 1.590
mgd.acc_accession_1 │
169,024 │ 80,382 │ 2.103
mgd.acc_accession_idx_prefixpart │
76,661 │ 80,382 │ 0.954
mgd.acc_accession_idx_mgitype_key │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_clustered │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_createdby_key │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_numericpart │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_logicaldb_key │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_idx_modifiedby_key │
76,661 │ 76,928 │ 0.997
mgd.acc_accession_pkey │
76,661 │ 76,928 │ 0.997
mgd.mgi_relationship_property_idx_propertyname_key │
74,197 │ 74,462 │ 0.996
mgd.mgi_relationship_property_idx_modifiedby_key │
74,197 │ 74,462 │ 0.996
mgd.mgi_relationship_property_pkey │
74,197 │ 74,462 │ 0.996
mgd.mgi_relationship_property_idx_clustered │
74,197 │ 74,462 │ 0.996
mgd.mgi_relationship_property_idx_createdby_key │
74,197 │ 74,462 │ 0.996
mgd.seq_sequence_idx_clustered │
50,051 │ 50,486 │ 0.991
mgd.seq_sequence_raw_pkey │
35,826 │ 35,952 │ 0.996
mgd.seq_sequence_raw_idx_modifiedby_key │
35,826 │ 35,952 │ 0.996
mgd.seq_source_assoc_idx_clustered │
35,822 │ 35,952 │ 0.996
(20 rows)

I haven't tried to make the underlying logic as close to perfect as
possible, but it tends to be accurate in practice, as is evident from
this real-world example (this shows larger indexes following a
restoration of the mouse genome sample database [1]). Perhaps there
could be a role for a refined bt_estimate_nblocks() function in
determining when B-Tree indexes become bloated/unbalanced (maybe have
pgstatindex() estimate index bloat based on a difference between
projected and actual fan-in?). That has nothing to do with parallel


bt_main_forks_identical() checks if 2 specific relations have bitwise
identical main forks. If they do, it returns the number of blocks in
the main fork of each. Otherwise, an error is raised.

Unlike any approach involving *writing* the index in parallel (e.g.,
any worthwhile approach based on data partitioning), the proposed
parallel CREATE INDEX implementation creates an identical index
representation to that created by any serial process (including, for
example, the master branch when CREATE INDEX uses an internal sort).
The index that you end up with when parallelism is used ought to be
100% identical in all cases.

(This is true because there is a TID tie-breaker when sorting B-Tree
index tuples, and because LSNs are set to 0 by CREATE INDEX. Why not
exploit that fact to test the implementation?)

If anyone can demonstrate that parallel CREATE INDEX fails to create a
non-bitwise-identical index representation to a "known good"
implementation, or can demonstrate that it doesn't consistently
produce exactly the same final index representation given the same
underlying table as input, then there *must* be a bug.
bt_main_forks_identical() gives reviewers an easy way to verify this,
perhaps just in passing during benchmarking.


It occurs to me that parallel CREATE INDEX receives no special
consideration by pg_restore. This leaves it so that the use of
parallel CREATE INDEX can come down to whether or not
pg_class.reltuples is accidentally updated by something like an
initial CREATE INDEX. This is not ideal. There is also the questions
of how pg_restore -j cases ought to give special consideration to
parallel CREATE INDEX, if at all -- it's probably true that concurrent
index builds on the same relation do go together well with parallel
CREATE INDEX, but even in V5 pg_restore remains totally naive about

That having been said, pg_restore currently does nothing clever with
maintenance_work_mem when multiple jobs are used, even though that
seems at least as useful as what I outline for parallel CREATE INDEX.
It's not clear how to judge this.

What do we need to teach pg_restore about parallel CREATE INDEX, if
anything at all? Could this be as simple as a blanket disabling of
parallelism for CREATE INDEX from pg_restore? Or, does it need to be
more sophisticated than that? I suppose that tools like reindexdb and
pgbench must be considered in a similar way.

Maybe we could get the number of blocks in the heap relation when its
pg_class.reltupes is 0, from the smgr, and then extrapolate the
reltuples using simple, generic logic, in the style of
vac_estimate_reltuples() (its "old_rel_pages" == 0 case). For now,
I've avoided doing that out of concern for the overhead in cases where
there are many small tables to be restored, and because it may be
better to err on the side of not using parallelism.

Peter Geoghegan

Attachment Content-Type Size
0001-Cap-the-number-of-tapes-used-by-external-sorts.patch.gz application/x-gzip 1.8 KB
0003-Add-temporary-testing-tools.patch.gz application/x-gzip 4.7 KB
0002-Add-parallel-B-tree-index-build-sorting.patch.gz application/x-gzip 57.0 KB

In response to


Browse pgsql-hackers by date

  From Date Subject
Next Message Tsunakawa, Takayuki 2016-11-08 04:36:21 Re: Re: BUG #13755: pgwin32_is_service not checking if SECURITY_SERVICE_SID is disabled
Previous Message Amit Langote 2016-11-08 04:08:45 Re: Declarative partitioning - another take