RE: Parallel INSERT (INTO ... SELECT ...)

From: "tsunakawa(dot)takay(at)fujitsu(dot)com" <tsunakawa(dot)takay(at)fujitsu(dot)com>
To: 'Greg Nancarrow' <gregn4422(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Antonin Houska <ah(at)cybertec(dot)at>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Parallel INSERT (INTO ... SELECT ...)
Date: 2021-01-24 23:23:43
Message-ID: TYAPR01MB29903F0BEB8D41D287697612FEBE0@TYAPR01MB2990.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello Greg-san,

Second group of comments (I'll reply to (1) - (4) later):

(5)
@@ -790,7 +790,8 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
...
- if (plannedstmt->commandType != CMD_SELECT || plannedstmt->hasModifyingCTE)
+ if ((plannedstmt->commandType != CMD_SELECT &&
+ !IsModifySupportedInParallelMode(plannedstmt->commandType)) || plannedstmt->hasModifyingCTE)
PreventCommandIfParallelMode(CreateCommandName((Node *) plannedstmt));
}

Now that we're trying to allow parallel writes (INSERT), we should:

* use ExecCheckXactReadOnly() solely for checking read-only transactions, as the function name represents. That is, move the call to PreventCommandIfParallelMode() up to standard_ExecutorStart().

* Update the comment above the call to ExecCheckXactReadOnly().

(6)
@@ -764,6 +777,22 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
...
+ else
+ {
+ pei->processed_count = NULL;
+ }

The braces can be deleted.

(7)
@@ -1400,6 +1439,16 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
true);
queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);

+ Assert(queryDesc->operation == CMD_SELECT || IsModifySupportedInParallelMode(queryDesc->operation));
+ if (IsModifySupportedInParallelMode(queryDesc->operation))
+ {
+ /*
+ * Record that the CurrentCommandId is used, at the start of the
+ * parallel operation.
+ */
+ SetCurrentCommandIdUsedForWorker();
+ }
+
/* Setting debug_query_string for individual workers */
debug_query_string = queryDesc->sourceText;

@@ -765,12 +779,16 @@ GetCurrentCommandId(bool used)
if (used)
{
/*
- * Forbid setting currentCommandIdUsed in a parallel worker, because
- * we have no provision for communicating this back to the leader. We
- * could relax this restriction when currentCommandIdUsed was already
- * true at the start of the parallel operation.
+ * If in a parallel worker, only allow setting currentCommandIdUsed if
+ * currentCommandIdUsed was already true at the start of the parallel
+ * operation (by way of SetCurrentCommandIdUsed()), otherwise forbid
+ * setting currentCommandIdUsed because we have no provision for
+ * communicating this back to the leader. Once currentCommandIdUsed is
+ * set, the commandId used by leader and workers can't be changed,
+ * because CommandCounterIncrement() then prevents any attempted
+ * increment of the current commandId.
*/
- Assert(!IsParallelWorker());
+ Assert(!(IsParallelWorker() && !currentCommandIdUsed));
currentCommandIdUsed = true;
}
return currentCommandId;

What happens without these changes? If this kind of change is really necessary, it seems more natural to pass currentCommandIdUsed together with currentCommandId through SerializeTransactionState() and StartParallelWorkerTransaction(), instead of the above changes.

As an aside, SetCurrentCommandIdUsed() in the comment should be SetCurrentCommandIdUsedForWorker().

(8)
+ /*
+ * If the trigger type is RI_TRIGGER_FK, this indicates a FK exists in
+ * the relation, and this would result in creation of new CommandIds
+ * on insert/update/delete and this isn't supported in a parallel
+ * worker (but is safe in the parallel leader).
+ */
+ trigtype = RI_FKey_trigger_type(trigger->tgfoid);
+ if (trigtype == RI_TRIGGER_FK)
+ {
+ if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
+ return true;
+ }

Here, RI_TRIGGER_FK should instead be RI_TRIGGER_PK, because RI_TRIGGER_FK triggers do not generate command IDs. See RI_FKey_check() which is called in RI_TRIGGER_FK case. In there, ri_PerformCheck() is called with the detectNewRows argument set to false, which causes CommandCounterIncrement() to not be called.

Plus, tables that have RI_TRIGGER_PK should allow parallel INSERT in a parallel-safe manner, because those triggers only fire for UPDATE and DELETE. So, for the future parallel UPDATE/DELETE support, the above check should be performed in UPDATE and DELETE cases.

(In a data warehouse, fact tables, which store large amounts of historical data, typically have foreign keys to smaller dimension tables. Thus, it's important to allow parallel INSERTs on tables with foreign keys.)

Regards
Takayuki Tsunakawa

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiro Ikeda 2021-01-24 23:33:49 Re: About to add WAL write/fsync statistics to pg_stat_wal view
Previous Message Mark Rofail 2021-01-24 20:46:49 Re: [HACKERS] GSoC 2017: Foreign Key Arrays