Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

From: Marco Slot <marco(dot)slot(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Önder Kalacı <onderkalaci(at)gmail(dot)com>, shiy(dot)fnst(at)fujitsu(dot)com, wangw(dot)fnst(at)fujitsu(dot)com, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Date: 2023-01-20 12:35:51
Message-ID: CAFMSG9E7_JgnaBj1f_jU4dEmOnL=K6UyBAPR7eZ0NukUcR2L9g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 9, 2023 at 11:37 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Anyway, to get back to the point at hand: if we do have a REPLICA IDENTITY
> FULL situation then we can make use of any unique index over a subset of
> the transmitted columns, and if there's more than one candidate index
> it's unlikely to matter which one we pick. Given your comment I guess
> we have to also compare the non-indexed columns, so we can't completely
> convert the FULL case to the straight index case. But still it doesn't
> seem to me to be appropriate to use the planner to find a suitable index.

The main purpose of REPLICA IDENTITY FULL seems to be to enable logical
replication for tables that may have duplicates and therefore cannot have a
unique index that can be used as a replica identity.

For those tables the user currently needs to choose between update/delete
erroring (bad) or doing a sequential scan on the apply side per
updated/deleted tuple (often worse). This issue currently prevents a lot of
automation around logical replication, because users need to decide whether
and when they are willing to accept partial downtime. The current REPLICA
IDENTITY FULL implementation can work in some cases, but applying the
effects of an update that affected a million rows through a million
sequential scans will certainly not end well.

This patch solves the problem by allowing the apply side to pick a
non-unique index to find any matching tuple instead of always using a
sequential scan, but that either requires some planning/costing logic to
avoid picking a lousy index, or allowing the user to manually preselect the
index to use, which is less convenient.

An alternative might be to construct prepared statements and using the
regular planner. If applied uniformly that would also be nice from the
extensibility point-of-view, since there is currently no way for an
extension to augment the apply side. However, I assume the current approach
of using low-level functions in the common case was chosen for performance
reasons.

I suppose the options are:
1. use regular planner uniformly
2. use regular planner only when there's no replica identity (or
configurable?)
3. only use low-level functions
4. keep using sequential scans for every single updated row
5. introduce a hidden logical row identifier in the heap that is guaranteed
unique within a table and can be used as a replica identity when no unique
index exists

Any thoughts?

cheers,
Marco

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ted Yu 2023-01-20 13:23:16 Re: Operation log for major operations
Previous Message Karl O. Pinc 2023-01-20 12:26:00 Re: Doc: Rework contrib appendix -- informative titles, tweaked sentences