Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

From: Önder Kalacı <onderkalaci(at)gmail(dot)com>
To: Peter Smith <smithpb2250(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, "shiy(dot)fnst(at)fujitsu(dot)com" <shiy(dot)fnst(at)fujitsu(dot)com>, "wangw(dot)fnst(at)fujitsu(dot)com" <wangw(dot)fnst(at)fujitsu(dot)com>, Marco Slot <marco(dot)slot(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date: 2023-03-06 11:18:29
Message-ID: CACawEhX_n8JcmKty0hFeStbTa5OFpn40sPM=4zontkN0tSgOZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter, all

>
> 1.
> A published table must have a replica identity configured in order to
> be able to replicate UPDATE and DELETE operations, so that appropriate
> rows to update or delete can be identified on the subscriber side. By
> default, this is the primary key, if there is one. Another unique
> index (with certain additional requirements) can also be set to be the
> replica identity. When replica identity FULL is specified, indexes can
> be used on the subscriber side for searching the rows. These indexes
> should be btree, non-partial and have at least one column reference
> (e.g., should not consist of only expressions). These restrictions on
> the non-unique index properties are in essence the same restrictions
> that are enforced for primary keys. Internally, we follow the same
> approach for supporting index scans within logical replication scope.
> If there are no such suitable indexes, the search on the subscriber
> side can be very inefficient, therefore replica identity FULL should
> only be used as a fallback if no other solution is possible. If a
> replica identity other than full is set on the publisher side, a
> replica identity comprising the same or fewer columns must also be set
> on the subscriber side. See REPLICA IDENTITY for details on how to set
> the replica identity. If a table without a replica identity is added
> to a publication that replicates UPDATE or DELETE operations then
> subsequent UPDATE or DELETE operations will cause an error on the
> publisher. INSERT operations can proceed regardless of any replica
> identity.
>
> ~
>
> 1a.
> Changes include:
> "should" --> "must"
> "e.g." --> "i.e."
>
>
makes sense

> BEFORE
> These indexes should be btree, non-partial and have at least one
> column reference (e.g., should not consist of only expressions).
>
> SUGGESTION
> Candidate indexes must be btree, non-partial, and have at least one
> column reference (i.e., cannot consist of only expressions).
>
> ~
>
> 1b.
> The fix for my v27 review comment #2b (changing "full" to FULL) was
> not made correctly. It should be uppercase FULL, not full:
> "other than full" --> "other than FULL"
>

Right, changed that.

>
> ======
> src/backend/executor/execReplication.c
>
> 2.
> /*
> * Setup a ScanKey for a search in the relation 'rel' for a tuple 'key'
> that
> * is setup to match 'rel' (*NOT* idxrel!).
> *
> - * Returns whether any column contains NULLs.
> + * Returns how many columns should be used for the index scan.
> + *
> + * This is not generic routine, it expects the idxrel to be
> + * a btree, non-partial and have at least one column
> + * reference (e.g., should not consist of only expressions).
> *
> - * This is not generic routine, it expects the idxrel to be replication
> - * identity of a rel and meet all limitations associated with that.
> + * By definition, replication identity of a rel meets all
> + * limitations associated with that. Note that any other
> + * index could also meet these limitations.
> */
> -static bool
> +static int
> build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
> TupleTableSlot *searchslot)
>
> ~
>
> "(e.g., should not consist of only expressions)" --> "(i.e., cannot
> consist of only expressions)"
>
>
fixed

> ======
> src/backend/replication/logical/relation.c
>
> 3. FindUsableIndexForReplicaIdentityFull
>
> +/*
> + * Returns the oid of an index that can be used via the apply
> + * worker. The index should be btree, non-partial and have at
> + * least one column reference (e.g., should not consist of
> + * only expressions). The limitations arise from
> + * RelationFindReplTupleByIndex(), which is designed to handle
> + * PK/RI and these limitations are inherent to PK/RI.
>
> The 2nd sentence of this comment should match the same changes in the
> Commit message --- "must not" instead of "should not", "i.e." instead
> of "e.g.", etc. See the review comment #1a above.
>
>
Isn't "cannot" better than "must not" ? You also seem to suggest "cannot"
just above.

I changed it to "cannot" in all places.

> ~~~
>
> 4. IdxIsRelationIdentityOrPK
>
> +/*
> + * Given a relation and OID of an index, returns true if the
> + * index is relation's replica identity index or relation's
> + * primary key's index.
> + *
> + * Returns false otherwise.
> + */
> +bool
> +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid)
> +{
> + Assert(OidIsValid(idxoid));
> +
> + return GetRelationIdentityOrPK(rel) == idxoid;
> +}
>
> I think you've "simplified" this function in v28 but AFAICT now it has
> a different logic to v27.
>
> PREVIOUSLY it was coded like
> + return RelationGetReplicaIndex(rel) == idxoid ||
> + RelationGetPrimaryKeyIndex(rel) == idxoid;
>
> But now in that scenario, it won't
> even check the PK if there was a valid RI. So it might return false
> when previously it returned true. Is it deliberate?
>
>
Thanks for detailed review/investigation on this. But, I also agree that
there is no difference in terms of correctness. Also, it is probably better
to be consistent with the existing code. So,
making IdxIsRelationIdentityOrPK()
relying on GetRelationIdentityOrPK() still sounds better to me.

You can see if 'idxoid' did NOT match RI but if it DID match PK
> previously it would return true.

Still, I cannot see how this check would yield a different result with how
RI/PK works -- as Amit also noted in the next e-mail.

Do you see any cases where this check would produce a different result?
I cannot, but wanted to double check with you.

> ======
> .../subscription/t/032_subscribe_use_index.pl
>
> 5.
> +# Testcase start: SUBSCRIPTION USES INDEX WITH PUB/SUB different data
> +#
> +# The subscriber has duplicate tuples that publisher does not have.
> +# When publsher updates/deletes 1 row, subscriber uses indexes and
> +# exactly updates/deletes 1 row.
>
> "and exactly updates/deletes 1 row." --> "and updates/deletes exactly 1
> row."
>
>
> Fixed.

Given the changes are small, I'll incorporate the changes with v33 in my
next e-mail.

Thanks,
Onder KALACI

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Önder Kalacı 2023-03-06 11:18:39 Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Previous Message Önder Kalacı 2023-03-06 11:18:19 Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher