Re: speeding up planning with partitions

From: David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>
To: Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: speeding up planning with partitions
Date: 2018-09-10 23:23:00
Message-ID: CAKJS1f_xnACVym3QbBp5jshwK=SF9jseCUxh=q79LQQNSUPxUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 30 August 2018 at 21:29, Amit Langote <Langote_Amit_f8(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Attached updated patches, with 0002 containing the changes mentioned above.

Here's my first pass review on this:

0001:

1. I think the following paragraph should probably mention something
about some performance difference between the two methods:

<para>
Constraint exclusion works in a very similar way to partition
pruning, except that it uses each table's <literal>CHECK</literal>
constraints &mdash; which gives it its name &mdash; whereas partition
pruning uses the table's partition bounds, which exist only in the
case of declarative partitioning. Another difference is that
constraint exclusion is only applied at plan time; there is no attempt
to remove partitions at execution time.
</para>

Perhaps tagging on. "Constraint exclusion is also a much less
efficient way of eliminating unneeded partitions as metadata for each
partition must be loaded in the planner before constraint exclusion
can be applied. This is not a requirement for partition pruning."

2. I think "rootrel" should be named "targetpart" in:

+ RelOptInfo *rootrel = root->simple_rel_array[root->parse->resultRelation];

3. Why does the following need to list_copy()?

+ List *saved_join_info_list = list_copy(root->join_info_list);

4. Is the "root->parse->commandType != CMD_INSERT" required in:

@@ -181,13 +185,30 @@ make_one_rel(PlannerInfo *root, List *joinlist)

/*
* Generate access paths for the entire join tree.
+ *
+ * If we're doing this for an UPDATE or DELETE query whose target is a
+ * partitioned table, we must do the join planning against each of its
+ * leaf partitions instead.
*/
- rel = make_rel_from_joinlist(root, joinlist);
+ if (root->parse->resultRelation &&
+ root->parse->commandType != CMD_INSERT &&
+ root->simple_rel_array[root->parse->resultRelation] &&
+ root->simple_rel_array[root->parse->resultRelation]->part_scheme)
+ {

Won't the simple_rel_array entry for the resultRelation always be NULL
for an INSERT?

5. In regards to:

+ /*
+ * Hack to make the join planning code believe that 'partrel' can
+ * be joined against.
+ */
+ partrel->reloptkind = RELOPT_BASEREL;

Have you thought about other implications of join planning for other
member rels, for example, equivalence classes and em_is_child?

6. It would be good to not have to rt_fetch the same RangeTblEntry twice here:

@@ -959,7 +969,9 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
* needs special processing, else go straight to grouping_planner.
*/
if (parse->resultRelation &&
- rt_fetch(parse->resultRelation, parse->rtable)->inh)
+ rt_fetch(parse->resultRelation, parse->rtable)->inh &&
+ rt_fetch(parse->resultRelation, parse->rtable)->relkind !=
+ RELKIND_PARTITIONED_TABLE)
inheritance_planner(root);

7. Why don't you just pass the Parse into the function as a parameter
instead of overwriting PlannerInfo's copy in:

+ root->parse = partition_parse;
+ partitionwise_adjust_scanjoin_target(root, child_rel,
+ subroots,
+ partitioned_rels,
+ resultRelations,
+ subpaths,
+ WCOLists,
+ returningLists,
+ rowMarks);
+ /* Restore the Query for processing the next partition. */
+ root->parse = parse;

8. Can you update the following comment to mention why you're not
calling add_paths_to_append_rel for this case?

@@ -6964,7 +7164,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
}

/* Build new paths for this relation by appending child paths. */
- if (live_children != NIL)
+ if (live_children != NIL &&
+ !(rel->reloptkind == RELOPT_BASEREL &&
+ rel->relid == root->parse->resultRelation))
add_paths_to_append_rel(root, rel, live_children);

9. The following should use >= 0, not > 0

+ while ((relid = bms_next_member(child_rel->relids, relid)) > 0)

0002:

10. I think moving the PlannerInfo->total_table_pages calculation
needs to be done in its own patch. This is a behavioural change where
we no longer count pruned relations in the calculation which can
result in plan changes. I posted a patch in [1] to fix this as a bug
fix as I think the current code is incorrect. My patch also updates
the first paragraph of the comment. You've not done that.

11. "pruning"

+ * And do prunin. Note that this adds AppendRelInfo's of only the

12. It's much more efficient just to do bms_add_range() outside the loop in:

+ for (i = 0; i < rel->nparts; i++)
+ {
+ rel->part_rels[i] = build_partition_rel(root, rel,
+ rel->part_oids[i]);
+ rel->live_parts = bms_add_member(rel->live_parts, i);
+ }

13. In set_append_rel_size() the foreach(l, root->append_rel_list)
loop could be made to loop over RelOptInfo->live_parts instead which
would save having to skip over AppendRelInfos that don't belong to
this parent. You'd need to make live_parts more general purpose and
also use it to mark children of inheritance parents.

14. I think you can skip the following if both are NULL. You could
likely add more smarts for different join types, but I think you
should at least skip when both are NULL. Perhaps the loop could be
driven off of bms_intersec of the two Rel's live_parts?

+ if (child_rel1 == NULL)
+ child_rel1 = build_dummy_partition_rel(root, rel1, cnt_parts);
+ if (child_rel2 == NULL)
+ child_rel2 = build_dummy_partition_rel(root, rel2, cnt_parts);

15. The following is not required when append_rel_array was previously NULL.

+ MemSet(root->append_rel_array + root->simple_rel_array_size,
+ 0, sizeof(AppendRelInfo *) * num_added_parts);

16. I don't think scan_all_parts is worth the extra code. The cost of
doing bms_num_members is going to be pretty small in comparison to
building paths for and maybe doing a join search for all partitions.

+ num_added_parts = scan_all_parts ? rel->nparts :
+ bms_num_members(partindexes);

In any case, you've got a bug in prune_append_rel_partitions() where
you're setting scan_all_parts to true instead of returning when
contradictory == true.

17. lockmode be of type LOCKMODE, not int.

+ Oid childOID, int lockmode,

18. Comment contradicts the code.

+ /* Open rel if needed; we already have required locks */
+ if (childOID != parentOID)
+ {
+ childrel = heap_open(childOID, lockmode);

I think you should be using NoLock here.

19. Comment contradicts the code.

+ /* Close child relations, but keep locks */
+ if (childOID != parentOID)
+ {
+ Assert(childrel != NULL);
+ heap_close(childrel, lockmode);
+ }

20. I assume translated_vars can now be NIL due to
build_dummy_partition_rel() not setting it.

- if (var->varlevelsup == 0 && appinfo)
+ if (var->varlevelsup == 0 && appinfo && appinfo->translated_vars)

It might be worth writing a comment to explain that, otherwise it's
not quite clear why you're doing this.

21. Unrelated change;

Assert(relid > 0 && relid < root->simple_rel_array_size);
+

22. The following comment mentions that Oids are copied, but that does
not happen in the code.

+ /*
+ * For partitioned tables, we just allocate space for RelOptInfo's.
+ * pointers for all partitions and copy the partition OIDs from the
+ * relcache. Actual RelOptInfo is built for a partition only if it is
+ * not pruned.
+ */

The Oid copying already happened during get_relation_info().

23. Traditionally translated_vars populated with a sequence of Vars in
the same order to mean no translation. Here you're changing how that
works:

+ /* leaving translated_vars to NIL to mean no translation needed */

This seems to be about the extent of your documentation on this, which
is not good enough.

24. "each" -> "reach"? ... Actually, I don't understand the comment.
In a partitioned hierarchy, how can the one before the top-level
partitioned table not be a partitioned table?

/*
* Keep moving up until we each the parent rel that's not a
* partitioned table. The one before that one would be the root
* parent.
*/

25. "already"

+ * expand_inherited_rtentry alreay locked all partitions, so pass

26. Your changes to make_partitionedrel_pruneinfo() look a bit broken.
You're wrongly assuming live_parts is the same as present_parts. If a
CHECK constraint eliminated the partition then those will be present
in live_parts but won't be part of the Append/MergeAppend subplans.
You might be able to maintain some of this optimisation by checking
for dummy rels inside the loop, but you're going to need to put back
the code that sets present_parts.

+ present_parts = bms_copy(subpart->live_parts);

27. Comment contradicts the code:

+ Bitmapset *live_parts; /* unpruned parts; NULL if all are live */

in add_rel_partitions_to_query() you're doing:

+ if (scan_all_parts)
+ {
+ for (i = 0; i < rel->nparts; i++)
+ {
+ rel->part_rels[i] = build_partition_rel(root, rel,
+ rel->part_oids[i]);
+ rel->live_parts = bms_add_member(rel->live_parts, i);
+ }
+ }

so the NULL part seems untrue.

28. Missing comments:

+ TupleDesc tupdesc;
+ Oid reltype;

29. The comment for prune_append_rel_partitions claims it "Returns RT
indexes", but that's not the case, per:

-Relids
-prune_append_rel_partitions(RelOptInfo *rel)
+void
+prune_append_rel_partitions(PlannerInfo *root, RelOptInfo *rel)

0003:

30. 2nd test can be tested inside the first test to remove redundant
partition check.

- inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+ if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+ inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);

/*
* Check that there's at least one descendant, else treat as no-child
* case. This could happen despite above has_subclass() check, if table
* once had a child but no longer does.
*/
- if (list_length(inhOIDs) < 2)
+ if (rte->relkind != RELKIND_PARTITIONED_TABLE && list_length(inhOIDs) < 2)
{

31. The following code is wrong:

+ /* Determine the correct lockmode to use. */
+ if (rootRTindex == root->parse->resultRelation)
+ lockmode = RowExclusiveLock;
+ else if (rootrc && RowMarkRequiresRowShareLock(rootrc->markType))
+ lockmode = RowShareLock;
+ else
+ lockmode = AccessShareLock;

rootRTIndex remains at 0 if there are no row marks and resultRelation
will be 0 for SELECT queries, this means you'll use a RowExclusiveLock
for a SELECT instead of an AccessShareLock.

Why not just check the lockmode of the parent and use that?

[1] https://commitfest.postgresql.org/19/1768/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2018-09-11 01:11:15 Re: speeding up planning with partitions
Previous Message Tomas Vondra 2018-09-10 22:12:36 Re: [HACKERS] PATCH: multivariate histograms and MCV lists