Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

From: Önder Kalacı <onderkalaci(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date: 2022-07-22 16:15:16
Message-ID: CACawEhWV=H_jDyJPm_CUqkBZNMkK-1gtt3ddShj_7p3eF3iwRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

>
> >
> > One another idea could be to re-calculate the index, say after N
> updates/deletes for the table. We may consider using subscription_parameter
> for getting N -- with a good default, or even hard-code into the code. I
> think the cost of re-calculating should really be pretty small compared to
> the other things happening during logical replication. So, a sane default
> might work?
> >
>
> One difficulty in deciding the value of N for the user or choosing a
> default would be that we need to probably consider the local DML
> operations on the table as well.
>
>
Fair enough, it is not easy to find a good default.

> > If you think the above doesn't work, I can try to work on a separate
> patch which adds something like "analyze invalidation callback".
> >
>
> I suggest we should give this a try and if this turns out to be
> problematic or complex then we can think of using some heuristic as
> you are suggesting above.
>

Alright, I'll try this and respond shortly back.

> >
> > OR use a complex heuristic with a reasonable index pick. And, the latter
> approach converged to the planner code in the patch. Do you think the
> former approach is acceptable?
> >
>
> In this regard, I was thinking in which cases a sequence scan can be
> better than the index scan (considering one is available). I think if
> a certain column has a lot of duplicates (for example, a column has a
> boolean value) then probably doing a sequence scan is better. Now,
> considering this even though your other approach sounds simpler but
> could lead to unpredictable results. So, I think the latter approach
> is preferable.
>

Yes, it makes sense. I also considered this during the development of the
patch, but forgot to mention :)

>
> BTW, do we want to consider partial indexes for the scan in this
> context? I mean it may not have data of all rows so how that would be
> usable?
>
>
As far as I can see, check_index_predicates() never picks a partial index
for the baserestrictinfos we create in CreateReplicaIdentityFullPaths().
The reason is that we have roughly the following call stack:

-check_index_predicates
--predicate_implied_by
---predicate_implied_by_recurse
----predicate_implied_by_simple_clause
-----operator_predicate_proof

And, inside operator_predicate_proof(), there is never going to be an
equality. Because, we push `Param`s to the baserestrictinfos whereas the
index predicates are always `Const`.

If we want to make it even more explicit, I can filter out `Path`s with
partial indexes. But that seems redundant to me. For now, I pushed the
commit with an assertion that we never pick partial indexes and also added
a test.

If you think it is better to explicitly filter out partial indexes, I can
do that as well.

> Few comments:
> ===============
> 1.
> static List *
> +CreateReplicaIdentityFullPaths(Relation localrel)
> {
> ...
> + /*
> + * Rather than doing all the pushups that would be needed to use
> + * set_baserel_size_estimates, just do a quick hack for rows and width.
> + */
> + rel->rows = rel->tuples;
>
> Is it a good idea to set rows without any selectivity estimation?
> Won't this always set the entire rows in a relation? Also, if we don't
> want to use set_baserel_size_estimates(), how will we compute
> baserestrictcost which will later be used in the costing of paths (for
> example, costing of seqscan path (cost_seqscan) uses it)?
>
> In general, I think it will be better to consider calling some
> top-level planner functions even for paths. Can we consider using
> make_one_rel() instead of building individual paths?

Thanks, this looks like a good suggestion/simplification. I wanted to use
the least amount of code possible, and make_one_rel() does either what I
exactly need or slightly more, which is great.

Note that make_one_rel() also follows the same call stack that I noted
above. So, I cannot spot any problems with partial indexes. Maybe am I
missing something here?

> On similar lines,
> in function PickCheapestIndexPathIfExists(), can we use
> set_cheapest()?
>
>
Yes, make_one_rel() + set_cheapest() sounds better. Changed.

> 2.
> @@ -57,9 +60,6 @@ build_replindex_scan_key(ScanKey skey, Relation rel,
> Relation idxrel,
> int2vector *indkey = &idxrel->rd_index->indkey;
> bool hasnulls = false;
>
> - Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel) ||
> - RelationGetPrimaryKeyIndex(rel) == RelationGetRelid(idxrel));
>
> You have removed this assertion but there is a comment ("This is not
> generic routine, it expects the idxrel to be replication identity of a
> rel and meet all limitations associated with that.") atop this
> function which either needs to be changed/removed and probably we
> should think if the function needs some change after removing that
> restriction.
>
>
Ack, I can see your point. I think, for example, we should skip index
attributes that are not simple column references. And, probably whatever
other restrictions that PRIMARY has, should be here.

I'll read some more Postgres code & test before pushing a revision for this
part. In the meantime, if you have any suggestions/pointers for me to look
into, please note here.

Attached v2 of the patch with addressing some of the comments you had. I'll
work on the remaining shortly.

Thanks,
Onder

Attachment Content-Type Size
v2_0001_use_index_on_subs_when_pub_rep_ident_full.patch application/octet-stream 42.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joe Conway 2022-07-22 16:49:10 Re: [PATCH] Introduce array_shuffle() and array_sample()
Previous Message Julien Rouhaud 2022-07-22 16:08:30 Re: Expose Parallelism counters planned/execute in pg_stat_statements