Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

From: Önder Kalacı <onderkalaci(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date: 2022-08-25 09:09:15
Message-ID: CACawEhXbw==K02v3=nHFEAFJqegx0b4r2J+FtXtKFkJeE6R95Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter, all

Thanks for the detailed review!

> 1. Commit message
>
> 1a.
> Majority of the logic on the subscriber side has already existed in the
> code.
>
> 1b.
> Second, when REPLICA IDENTITY IS FULL on the publisher and an index is
> used on the subscriber...
>
>
> 1c.
> Still, below I try to show case the potential improvements using an
> index on the subscriber
> `pgbench_accounts(bid)`. With the index, the replication catches up
> around ~5 seconds.
> When the index is dropped, the replication takes around ~300 seconds.
>
> "show case" -> "showcase"
>
> Applied your suggestions to 1a/1b/1c/

~
>
> 1d.
> In above text, what was meant by "catches up around ~5 seconds"?
> e.g. Did it mean *improves* by ~5 seconds, or *takes* ~5 seconds?
>
>
It "takes" 5 seconds to replicate all the changes. To be specific, I
execute 'SELECT sum(abalance) FROM pgbench_accounts' on the subscriber, and
try to measure the time until when all the changes are replicated. I do use
the same query on the publisher to check what the query result should be
when replication is done.

I updated the relevant text, does that look better?

> ~
>
> 1e.
> // create one indxe, even on a low cardinality column
>
> typo "indxe"
>
> ======
>

fixed.

Also, I realized that some of the comments on the commit message are stale,
updated those as well.

>
> 2. GENERAL
>
> 2a.
> There are lots of single-line comments that start lowercase, but by
> convention, I think they should start uppercase.
>
> e.g. + /* we should always use at least one attribute for the index scan */
> e.g. + /* we might not need this if the index is unique */
> e.g. + /* avoid expensive equality check if index is unique */
> e.g. + /* unrelated Path, skip */
> e.g. + /* simple case, we already have an identity or pkey */
> e.g. + /* indexscans are disabled, use seq. scan */
> e.g. + /* target is a regular table */
>
> ~~
>

Thanks for noting this, I didn't realize that there is a strict requirement
on this. Updated all of your suggestions, and realized one more such case.

Is there documentation where such conventions are listed? I couldn't
find any.

>
> 2b.
> There are some excess blank lines between the function. By convention,
> I think 1 blank line is normal, but here there are sometimes 2.
>
> ~~
>

Updated as well.

>
> 2c.
> There are some new function comments which include their function name
> in the comment. It seemed unnecessary.
>
> e.g. GetCheapestReplicaIdentityFullPath
> e.g. FindUsableIndexForReplicaIdentityFull
> e.g. LogicalRepUsableIndex
>
> ======
>

Fixed this as well.

>
> 3. src/backend/executor/execReplication.c - build_replindex_scan_key
>
> - int attoff;
> + int index_attoff;
> + int scankey_attoff;
> bool isnull;
> Datum indclassDatum;
> oidvector *opclass;
> int2vector *indkey = &idxrel->rd_index->indkey;
> - bool hasnulls = false;
> -
> - Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
> - RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));
>
> indclassDatum = SysCacheGetAttr(INDEXRELID, idxrel->rd_indextuple,
> Anum_pg_index_indclass, &isnull);
> Assert(!isnull);
> opclass = (oidvector *) DatumGetPointer(indclassDatum);
> + scankey_attoff = 0;
>
> Maybe just assign scankey_attoff = 0 at the declaration?
>
>
Again, lack of coding convention knowledge :/ My observation is that it is
often not assigned during the declaration. But, changed this one.

> ~~~
>
> 4.
>
> + /*
> + * There are two cases to consider. First, if the index is a primary or
> + * unique key, we cannot have any indexes with expressions. So, at this
> + * point we are sure that the index we deal is not these.
> + */
>
> "we deal" -> "we are dealing with" ?
>
> makes sense

> ~~~
>
> 5.
>
> + /*
> + * For a non-primary/unique index with an additional expression, do
> + * not have to continue at this point. However, the below code
> + * assumes the index scan is only done for simple column references.
> + */
> + continue;
>
> Is this one of those comments that ought to have a "XXX" prefix as a
> note for the future?
>

Makes sense

>
> ~~~
>
> 6.
>
> - int pkattno = attoff + 1;
> ...
> /* Initialize the scankey. */
> - ScanKeyInit(&skey[attoff],
> - pkattno,
> + ScanKeyInit(&skey[scankey_attoff],
> + index_attoff + 1,
> BTEqualStrategyNumber,
> Wondering if it would have been simpler if you just did:
> int pkattno = index_attoff + 1;
>

The index is not necessarily the primary key at this point, that's why
I removed it.

There are already 3 variables in the same function
index_attoff, scankey_attoff and table_attno, which are hard to avoid. But,
this one seemed ok to avoid, mostly to simplify the readability. Do you
think it is better with the additional variable? Still, I think we need a
better name as "pk" is not relevant anymore.

~~~
>
> 7.
>
> - skey[attoff].sk_flags |= SK_ISNULL;
> + skey[scankey_attoff].sk_flags |= SK_ISNULL;
> + skey[scankey_attoff].sk_flags |= SK_SEARCHNULL;
>
> SUGGESTION
> skey[scankey_attoff].sk_flags |= (SK_ISNULL | SK_SEARCHNULL)
>
>
looks good, changed

> ~~~
>
> 8. src/backend/executor/execReplication.c - RelationFindReplTupleByIndex
>
> @@ -128,28 +171,44 @@ RelationFindReplTupleByIndex(Relation rel, Oid
> idxoid,
> TransactionId xwait;
> Relation idxrel;
> bool found;
> + TypeCacheEntry **eq;
> + bool indisunique;
> + int scankey_attoff;
>
> /* Open the index. */
> idxrel = index_open(idxoid, RowExclusiveLock);
> + indisunique = idxrel->rd_index->indisunique;
> +
> + /* we might not need this if the index is unique */
> + eq = NULL;
>
> Maybe just default assign eq = NULL in the declaration?
>
>
Again, I wasn't sure if it is OK regarding the coding convention to assign
during the declaration. Changed now.

> ~~~
>
> 9.
>
> + scan = index_beginscan(rel, idxrel, &snap,
> + scankey_attoff, 0);
>
> Unnecessary wrapping?
>
>
Seems so, changed

> ~~~
>
> 10.
>
> + /* we only need to allocate once */
> + if (eq == NULL)
> + eq = palloc0(sizeof(*eq) * outslot->tts_tupleDescriptor->natts);
>
> But shouldn't you also free this 'eq' before the function returns, to
> prevent leaking memory?
>
>
Two notes here. First, this is allocated inside ApplyMessageContext, which
seems to be reset per tuple change. So, that seems like a good boundary to
keep this allocation in memory.

Second, RelationFindReplTupleSeq() doesn't free the same allocation roughly
at a very similar call stack. That's why I decided not to pfree. Do you see
strong reason to pfree at this point? Then we should probably change that
for RelationFindReplTupleSeq() as well.

> ======
>
> 11. src/backend/replication/logical/relation.c - logicalrep_rel_open
>
> + /*
> + * Finding a usable index is an infrequent operation, it is performed
> + * only when first time an operation is performed on the relation or
> + * after invalidation of the relation cache entry (e.g., such as ANALYZE).
> + */
>
> SUGGESTION (minor rewording)
> Finding a usable index is an infrequent task. It is performed only
> when an operation is first performed on the relation, or after
> invalidation of the relation cache entry (e.g., such as ANALYZE).
>
> ~~~
>
> makes sense, applied

> 12. src/backend/replication/logical/relation.c - logicalrep_partition_open
>
> Same as comment #11 above.
>
>
done

> ~~~
>
> 13. src/backend/replication/logical/relation.c - GetIndexOidFromPath
>
> +static
> +Oid
> +GetIndexOidFromPath(Path *path)
>
> Typically I think 'static Oid' should be on one line.
>

done

> ~~~
>
> 14.
>
> + switch (path->pathtype)
> + {
> + case T_IndexScan:
> + case T_IndexOnlyScan:
> + {
> + IndexPath *index_sc = (IndexPath *) path;
> + indexOid = index_sc->indexinfo->indexoid;
> +
> + break;
> + }
> +
> + default:
> + indexOid = InvalidOid;
> + }
>
> Is there any point in using a switch statement when there is only one
> functional code block?
>
> Why not just do:
>
> if (path->pathtype == T_IndexScan || path->pathtype == T_IndexOnlyScan)
> {
> ...
> }
>
> return InvalidOid;
>
> ~~~
>

Good point, in the first iterations of the patch, we also had Bitmap scans
here. Now, the switch is redundant, applied your suggestion.

>
> 15. src/backend/replication/logical/relation.c - IndexOnlyOnExpression
>
> + * Returns true if the given index consist only of expressions such as:
> + * CREATE INDEX idx ON table(foo(col));
>
> "consist" -> "consists"
>
> ~~~
>

fixed

>
> 16.
>
> +IndexOnlyOnExpression(IndexInfo *indexInfo)
> +{
> + int i=0;
> + for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++)
>
> Don't initialise 'i' twice.
>
> ~~~
>

fixed

>
> 17.
>
> + AttrNumber attnum = indexInfo->ii_IndexAttrNumbers[i];
> + if (AttributeNumberIsValid(attnum))
> + return false;
> +
> + }
>
> Spurious blank line
>
> ~~~
>

fixed

>
> 18. src/backend/replication/logical/relation.c -
> GetCheapestReplicaIdentityFullPath
>
> +/*
> + * Iterates over the input path list, and returns another path list
> + * where paths with non-btree indexes, partial indexes or
> + * indexes on only expressions are eliminated from the list.
> + */
>
> "path list, and" -> "path list and"
>
> ~~~
>

fixed

>
> 19.
>
> + if (!OidIsValid(indexOid))
> + {
> + /* unrelated Path, skip */
> + suitableIndexList = lappend(suitableIndexList, path);
> + continue;
> + }
> +
> + indexRelation = index_open(indexOid, AccessShareLock);
> + indexInfo = BuildIndexInfo(indexRelation);
> + is_btree_index = (indexInfo->ii_Am == BTREE_AM_OID);
> + is_partial_index = (indexInfo->ii_Predicate != NIL);
> + is_index_only_on_expression = IndexOnlyOnExpression(indexInfo);
> + index_close(indexRelation, NoLock);
> +
> + if (!is_btree_index || is_partial_index || is_index_only_on_expression)
> + continue;
>
> Maybe better to change this logic using if/else and changing the last
> condition so them you can avoid having any of those 'continue' in this
> loop.
>

Yes, it makes sense. It is good to avoid `continue` in the loop.

>
> ~~~
>
> 20. src/backend/replication/logical/relation.c -
> GetCheapestReplicaIdentityFullPath
>
> +/*
> + * GetCheapestReplicaIdentityFullPath generates all the possible paths
> + * for the given subscriber relation, assuming that the source relation
> + * is replicated via REPLICA IDENTITY FULL.
> + *
> + * The function assumes that all the columns will be provided during
> + * the execution phase, given that REPLICA IDENTITY FULL gurantees
> + * that.
> + */
>
> 20a.
> typo "gurantees"
>
> ~
>

Fixed, for future patches I'll do a more thorough review on these myself.
Sorry for all these typos & convention errors!

> 20b.
> The function comment neglects to say that after getting all these
> paths the final function return is the cheapest one that it found.
>
> ~~~
>

Improved the comment a bit

>
> 21.
>
> + for (attno = 0; attno < RelationGetNumberOfAttributes(localrel); attno++)
> + {
> + Form_pg_attribute attr = TupleDescAttr(localrel->rd_att, attno);
> +
> + if (attr->attisdropped)
> + {
> + continue;
> + }
> + else
> + {
> + Expr *eq_op;
>
> Maybe simplify by just removing the 'else' or instead just reverse the
> condition of the 'if'.
>
> ~~~
>

I like the second suggestion more, as the `!attr->attisdropped` code block
has local declarations, so keeping them local to that block seems easier
to follow.

>
> 22.
>
> + /*
> + * A sequential scan has could have been dominated by
> + * by an index scan during make_one_rel(). We should always
> + * have a sequential scan before set_cheapest().
> + */
>
> "has could have been" -> "could have been"
>
> ~~~
>

An interesting grammar I had :) Fixed

>
> 23. src/backend/replication/logical/relation.c - LogicalRepUsableIndex
>
> +static Oid
> +LogicalRepUsableIndex(Relation localrel, LogicalRepRelation *remoterel)
> +{
> + Oid idxoid;
> +
> + /*
> + * We never need index oid for partitioned tables, always rely on leaf
> + * partition's index.
> + */
> + if (localrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> + return InvalidOid;
> +
> + /* simple case, we already have an identity or pkey */
> + idxoid = GetRelationIdentityOrPK(localrel);
> + if (OidIsValid(idxoid))
> + return idxoid;
> +
> + /* indexscans are disabled, use seq. scan */
> + if (!enable_indexscan)
> + return InvalidOid;
>
> I thought the (!enable_indexscan) fast exit perhaps should be done
> first, or at least before calling GetRelationIdentityOrPK.
>

This is actually a point where I need some more feedback. On HEAD, even if
the index scan is disabled, we use the index. For this one, (a) I didn't
want to change the behavior for existing users (b) want to have a way to
disable this feature, and enable_indexscan seems like a good one.

Do you think I should dare to move it above GetRelationIdentityOrPK()? Or,
maybe I just need more comments? I improved the comment, and it would be
nice to hear your thoughts on this.

> ======
>
> 24. src/backend/replication/logical/worker.c - apply_handle_delete_internal
>
> @@ -2034,12 +2021,14 @@ apply_handle_delete_internal(ApplyExecutionData
> *edata,
> EPQState epqstate;
> TupleTableSlot *localslot;
> bool found;
> + Oid usableIndexOid = usable_indexoid_internal(edata, relinfo);
>
> EvalPlanQualInit(&epqstate, estate, NULL, NIL, -1);
> ExecOpenIndices(relinfo, false);
>
> - found = FindReplTupleInLocalRel(estate, localrel, remoterel,
> - remoteslot, &localslot);
> +
> + found = FindReplTupleInLocalRel(estate, localrel, usableIndexOid,
> + remoterel, remoteslot, &localslot);
>
> 24a.
> Excess blank line above FindReplTupleInLocalRel call.
>
> Fixed

> ~
>
> 24b.
> This code is almost same in function handle_update_internal(), except
> the wrapping of the params is different. Better to keep everything
> consistent looking.
>
>
Hmm, I have not changed how they look because they have one variable
difference (&relmapentry->remoterel vs remoterel), which requires the
indentation to be slightly difference. So, I either need a new variable or
keep them as-is?

> ~~~
>
> 25. src/backend/replication/logical/worker.c - usable_indexoid_internal
>
> +/*
> + * Decide whether we can pick an index for the relinfo (e.g., the
> relation)
> + * we're actually deleting/updating from. If it is a child partition of
> + * edata->targetRelInfo, find the index on the partition.
> + */
> +static Oid
> +usable_indexoid_internal(ApplyExecutionData *edata, ResultRelInfo
> *relinfo)
>
> I'm not sure is this can maybe return InvalidOid? The function comment
> should clarify it.
>
>
Improved the comment

> ~~~
>
> 26.
>
> I might be mistaken, but somehow I feel this function can be
> simplified. e.g. If you have a var 'relmapentry' and let the normal
> table use the initial value of that. Then I think you only need to
> test for the partitioned table and reassign that var as appropriate.
> It also removes the need for having 'usableIndexOid' var.
>
> FOR EXAMPLE,
>
> static Oid
> usable_indexoid_internal(ApplyExecutionData *edata, ResultRelInfo *relinfo)
> {
> ResultRelInfo *targetResultRelInfo = edata->targetRelInfo;
> LogicalRepRelMapEntry *relmapentry = edata->targetRel;
> Oid targetrelid = targetResultRelInfo->ri_RelationDesc->rd_rel->oid;
> Oid localrelid = relinfo->ri_RelationDesc->rd_id;
>
> if (targetrelid != localrelid)
> {
> /*
> * Target is a partitioned table, so find relmapentry of the partition.
> */
> TupleConversionMap *map = relinfo->ri_RootToPartitionMap;
> AttrMap *attrmap = map ? map->attrMap : NULL;
> LogicalRepPartMapEntry *part_entry =
> logicalrep_partition_open(relmapentry, relinfo->ri_RelationDesc,
> attrmap);
>
> Assert(targetResultRelInfo->ri_RelationDesc->rd_rel->relkind ==
> RELKIND_PARTITIONED_TABLE);
>
> relmapentry = part_entry->relmapentry;
> }
> return relmapentry->usableIndexOid;
> }
>
> ~~~
>

True, that simplifies the function, applied.

>
> 27.
>
> + /*
> + * Target is a partitioned table, get the index oid the partition.
> + */
>
> SUGGESTION
> Target is a partitioned table, so get the index oid of the partition.
>
> or (see the example of comment @26)
>
>
Applied

> ~~~
>
> 28. src/backend/replication/logical/worker.c - FindReplTupleInLocalRel
>
> @@ -2093,12 +2125,11 @@ FindReplTupleInLocalRel(EState *estate,
> Relation localrel,
>
> *localslot = table_slot_create(localrel, &estate->es_tupleTable);
>
> I think this might have been existing functionality...
>
> The comment says "* Local tuple, if found, is returned in
> '*localslot'." But the code is unconditionally doing
> table_slot_create() before it even knows if a tuple was found or not.
> So what about when it is NOT found - in that case shouldn't there be
> some cleaning up that (unused?) table slot that got unconditionally
> created?
>
>
This sounds accurate. But I guess it may not have been considered critical
as we are operating in the ApplyMessageContext? Tha is going to be freed
once a single tuple is dispatched.

I have a slight preference not to do it in this patch, but if you think
otherwise let me know.

> ~~~
>
> 29. src/backend/replication/logical/worker.c - apply_handle_tuple_routing
>
> @@ -2202,13 +2233,17 @@ apply_handle_tuple_routing(ApplyExecutionData
> *edata,
> * suitable partition.
> */
> {
> + LogicalRepRelMapEntry *entry;
> TupleTableSlot *localslot;
> ResultRelInfo *partrelinfo_new;
> bool found;
>
> + entry = &part_entry->relmapentry;
>
> Maybe just do this assignment at the entry declaration time?
>
>
done

> ~~~
>
> 30.
>
> /* Get the matching local tuple from the partition. */
> found = FindReplTupleInLocalRel(estate, partrel,
> - &part_entry->remoterel,
> + part_entry->relmapentry.usableIndexOid,
> + &entry->remoterel,
> remoteslot_part, &localslot);
> Why not use the new 'entry' var just assigned instead of repeating
> part_entry->relmapentry?
>
> SUGGESTION
> found = FindReplTupleInLocalRel(estate, partrel,
> entry->usableIndexOid,
> &entry->remoterel,
> remoteslot_part, &localslot);
>
> ~~~
>
> Yes, looks better, changed

> 31.
>
> + slot_modify_data(remoteslot_part, localslot, entry,
> newtup);
>
> Unnecessary wrapping.
>
> ======
>

I think I have not changed this, but fixed anyway

>
> 32. src/include/replication/logicalrelation.h
>
> +typedef struct LogicalRepPartMapEntry
> +{
> + Oid partoid; /* LogicalRepPartMap's key */
> + LogicalRepRelMapEntry relmapentry;
> +} LogicalRepPartMapEntry;
>
> IIUC this struct has been moved from relation.c to here. But I think
> there was a large comment about this struct which maybe needs to be
> moved with it (see the original relation.c).
>
> /*
> * Partition map (LogicalRepPartMap)
> *
> * When a partitioned table is used as replication target, replicated
> * operations are actually performed on its leaf partitions, which requires
> * the partitions to also be mapped to the remote relation. Parent's entry
> * (LogicalRepRelMapEntry) cannot be used as-is for all partitions, because
> * individual partitions may have different attribute numbers, which means
> * attribute mappings to remote relation's attributes must be maintained
> * separately for each partition.
> */
>
> ======
>
Oh, seems so, moved.

>
> 33. .../subscription/t/032_subscribe_use_index.pl
>
> Typo "MULTIPILE"
>
> This typo occurs several times...
>
> e.g. # Testcase start: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS
> e.g. # Testcase end: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS
> e.g. # Testcase start: SUBSCRIPTION USES INDEX WITH MULTIPILE COLUMNS
> e.g. # Testcase end: SUBSCRIPTION USES INDEX MODIFIES MULTIPILE ROWS
>
> ~~~
>
>
Yep :/ Fixed now

> 34.
>
> # Basic test where the subscriber uses index
> # and only touches multiple rows
>
> What does "only ... multiple" mean?
>
> This occurs several times also.
>
>
Ah, in the earlier iterations, the tests were updating/deleting 1 row.
Lately, I changed it to multiple rows, just to have more coverage. I guess
the discrepancy is because of that. Updated now.

> ~~~
>
> 35.
>
> +# wait for initial table synchronization to finish
> +$node_subscriber->wait_for_subscription_sync;
> +$node_subscriber->wait_for_subscription_sync;
> +$node_subscriber->wait_for_subscription_sync;
>
> That triple wait looks unusual. Is it deliberate?
>
> Ah, not really. Removed.

Thanks,
Onder

Attachment Content-Type Size
v9_0001_use_index_on_subs_when_pub_rep_ident_full.patch application/x-patch 72.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2022-08-25 09:19:05 Re: pg_receivewal and SIGTERM
Previous Message Alvaro Herrera 2022-08-25 08:58:16 Re: Letter case of "admin option"