Re: Parallel INSERT (INTO ... SELECT ...)

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Greg Nancarrow <gregn4422(at)gmail(dot)com>
Cc: Amit Langote <amitlangote09(at)gmail(dot)com>, "Hou, Zhijie" <houzj(dot)fnst(at)cn(dot)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, vignesh C <vignesh21(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Tsunakawa, Takayuki" <tsunakawa(dot)takay(at)fujitsu(dot)com>, "Tang, Haiying" <tanghy(dot)fnst(at)cn(dot)fujitsu(dot)com>
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Date: 2021-03-02 12:19:19
Message-ID: CAA4eK1LjE1riu6Ki0nFT7QT8ew1OeO4dr2POp+srp+TOTVuwow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 1, 2021 at 9:08 AM Greg Nancarrow <gregn4422(at)gmail(dot)com> wrote:
>
> Posting an updated set of patches that includes Amit Langote's patch
> to the partition tracking scheme...
>

Few comments:
==============
1.
"Prior to entering parallel-mode for execution of INSERT with parallel SELECT,
a TransactionId is acquired and assigned to the current transaction state which
is then serialized in the parallel DSM for the parallel workers to use."

This point in the commit message seems a bit misleading to me because
IIUC we need to use transaction id only in the master backend for the
purpose of this patch. And, we are doing this at an early stage
because we don't allow to allocate it in parallel-mode. I think here
we should mention in some way that this has a disadvantage that if the
underneath select doesn't return any row then this transaction id
allocation would not be of use. However, that shouldn't happen in
practice in many cases. But, if I am missing something and this is
required by parallel workers then we can ignore what I said.

2.
/*
+ * Prepare for entering parallel mode by assigning a
+ * FullTransactionId, to be included in the transaction state that is
+ * serialized in the parallel DSM.
+ */
+ (void) GetCurrentTransactionId();
+ }

Similar to the previous comment this also seems to indicate that we
require TransactionId for workers. If that is not the case then this
comment also needs to be modified.

3.
static int common_prefix_cmp(const void *a, const void *b);

-
/*****************************************************************************

Spurious line removal.

4.
* Assess whether it's feasible to use parallel mode for this query. We
* can't do this in a standalone backend, or if the command will try to
- * modify any data, or if this is a cursor operation, or if GUCs are set
- * to values that don't permit parallelism, or if parallel-unsafe
- * functions are present in the query tree.
+ * modify any data using a CTE, or if this is a cursor operation, or if
+ * GUCs are set to values that don't permit parallelism, or if
+ * parallel-unsafe functions are present in the query tree.

This comment change is not required because this is quite similar to
what we do for CTAS. Your further comment changes in this context are
sufficient.

5.
+ (IsModifySupportedInParallelMode(parse->commandType) &&
+ is_parallel_possible_for_modify(parse))) &&

I think it would be better if we move the check
IsModifySupportedInParallelMode inside
is_parallel_possible_for_modify. Also, it might be better to name this
function as is_parallel_allowed_for_modify.

6.
@@ -260,6 +265,21 @@ set_plan_references(PlannerInfo *root, Plan *plan)
*/
add_rtes_to_flat_rtable(root, false);

+ /*
+ * If modifying a partitioned table, add its parallel-safety-checked
+ * partitions too to glob->relationOids, to register them as plan
+ * dependencies. This is only really needed in the case of a parallel
+ * plan, so that if parallel-unsafe properties are subsequently defined
+ * on the partitions, the cached parallel plan will be invalidated and
+ * a non-parallel plan will be generated.
+ */
+ if (IsModifySupportedInParallelMode(root->parse->commandType))
+ {
+ if (glob->partitionOids != NIL && glob->parallelModeNeeded)
+ glob->relationOids =
+ list_concat(glob->relationOids, glob->partitionOids);
+ }
+

Isn't it possible to add these partitionOids in set_plan_refs with the
T_Gather(Merge) node handling? That looks like a more natural place to
me, if that is feasible then we don't need parallelModeNeeded check
and maybe we don't need to even check IsModifySupportedInParallelMode
but I am less sure of the second check requirement.

7.
+#include "access/table.h"
+#include "access/xact.h"
#include "access/transam.h"
+#include "catalog/pg_class.h"
#include "catalog/pg_type.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
@@ -24,6 +27,8 @@
#include "optimizer/planmain.h"
#include "optimizer/planner.h"
#include "optimizer/tlist.h"
+#include "parser/parsetree.h"
+#include "partitioning/partdesc.h"

I think apart from xact.h, we don't need new additional includes.

8. I see that in function target_rel_max_parallel_hazard, we don't
release the lock on the target table after checking parallel-safety
but then in function target_rel_max_parallel_hazard_recurse, we do
release the lock on partition tables after checking their
parallel-safety. Can we add some comments to explain these two cases?

9. I noticed that the tests added for the first patch in
v18-0002-Parallel-SELECT-for-INSERT-INTO-.-SELECT-tests-and-doc take
even more time than select_parallel. I think we should see if we can
reduce the timing of this test. I haven't studied it in detail but
maybe some Inserts execution can be avoided. In some cases like below
just checking the plan should be sufficient. I think you can try to
investigate and see how much we can reduce it without affecting on
code-coverage of newly added code.

+--
+-- Parallel unsafe column default, should not use a parallel plan
+--
+explain (costs off) insert into testdef(a,c,d) select a,a*4,a*8 from test_data;
+ QUERY PLAN
+-----------------------------
+ Insert on testdef
+ -> Seq Scan on test_data
+(2 rows)
+
+insert into testdef(a,c,d) select a,a*4,a*8 from test_data;
+select * from testdef order by a;
+ a | b | c | d
+----+---+----+----
+ 1 | 5 | 4 | 8
+ 2 | 5 | 8 | 16
+ 3 | 5 | 12 | 24
+ 4 | 5 | 16 | 32
+ 5 | 5 | 20 | 40
+ 6 | 5 | 24 | 48
+ 7 | 5 | 28 | 56
+ 8 | 5 | 32 | 64
+ 9 | 5 | 36 | 72
+ 10 | 5 | 40 | 80
+(10 rows)
+

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2021-03-02 12:22:23 Re: [Patch] ALTER SYSTEM READ ONLY
Previous Message Magnus Hagander 2021-03-02 12:12:07 Re: We should stop telling users to "vacuum that database in single-user mode"