Re: Minimal logical decoding on standbys

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: "Drouvot, Bertrand" <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Ibrar Ahmed <ibrar(dot)ahmad(at)gmail(dot)com>, Amit Khandekar <amitdkhan(dot)pg(at)gmail(dot)com>, fabriziomello(at)gmail(dot)com, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Rahila Syed <rahila(dot)syed(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minimal logical decoding on standbys
Date: 2023-01-23 23:21:21
Message-ID: 20230123232121.5zs7dd45f442ozwu@liskov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I'm new to this thread and subject, but I had a few basic thoughts about
the first patch in the set.

On Mon, Jan 23, 2023 at 12:03:35PM +0100, Drouvot, Bertrand wrote:
>
> Please find V42 attached.
>
> From 3c206bd77831d507f4f95e1942eb26855524571a Mon Sep 17 00:00:00 2001
> From: bdrouvotAWS <bdrouvot(at)amazon(dot)com>
> Date: Mon, 23 Jan 2023 10:07:51 +0000
> Subject: [PATCH v42 1/6] Add info in WAL records in preparation for logical
> slot conflict handling.
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> Overall design:
>
> 1. We want to enable logical decoding on standbys, but replay of WAL
> from the primary might remove data that is needed by logical decoding,
> causing replication conflicts much as hot standby does.

It is a little confusing to mention replication conflicts in point 1. It
makes it sound like it already logs a recovery conflict. Without the
recovery conflict handling in this patchset, logical decoding of
statements using data that has been removed will fail with some error
like :
ERROR: could not map filenumber "xxx" to relation OID
Part of what this patchset does is introduce the concept of a new kind
of recovery conflict and a resolution process.

> 2. Our chosen strategy for dealing with this type of replication slot
> is to invalidate logical slots for which needed data has been removed.
>
> 3. To do this we need the latestRemovedXid for each change, just as we
> do for physical replication conflicts, but we also need to know
> whether any particular change was to data that logical replication
> might access.

It isn't clear from the above sentence why you would need both. I think
it has something to do with what is below (hot_standby_feedback being
off), but I'm not sure, so the order is confusing.

> 4. We can't rely on the standby's relcache entries for this purpose in
> any way, because the startup process can't access catalog contents.
>
> 5. Therefore every WAL record that potentially removes data from the
> index or heap must carry a flag indicating whether or not it is one
> that might be accessed during logical decoding.
>
> Why do we need this for logical decoding on standby?
>
> First, let's forget about logical decoding on standby and recall that
> on a primary database, any catalog rows that may be needed by a logical
> decoding replication slot are not removed.
>
> This is done thanks to the catalog_xmin associated with the logical
> replication slot.
>
> But, with logical decoding on standby, in the following cases:
>
> - hot_standby_feedback is off
> - hot_standby_feedback is on but there is no a physical slot between
> the primary and the standby. Then, hot_standby_feedback will work,
> but only while the connection is alive (for example a node restart
> would break it)
>
> Then, the primary may delete system catalog rows that could be needed
> by the logical decoding on the standby (as it does not know about the
> catalog_xmin on the standby).
>
> So, it’s mandatory to identify those rows and invalidate the slots
> that may need them if any. Identifying those rows is the purpose of
> this commit.

I would like a few more specifics about this commit (patch 1 in the set)
itself in the commit message.

I think it would be good to have the commit message mention what kinds
of operations require WAL to contain information about whether or not it
is operating on a catalog table and why this is.

For example, I think the explanation of the feature makes it clear why
vacuum and pruning operations would require isCatalogRel, however it
isn't immediately obvious why page reuse would.

Also, because the diff has so many function signatures changed, it is
hard to tell which xlog record types are actually being changed to
include isCatalogRel. It might be too detailed/repetitive for the final
commit message to have a list of the xlog types requiring this info
(gistxlogPageReuse, spgxlogVacuumRedirect, xl_hash_vacuum_one_page,
xl_btree_reuse_page, xl_btree_delete, xl_heap_visible, xl_heap_prune,
xl_heap_freeze_page) but perhaps you could enumerate all the general
operations (freeze page, vacuum, prune, etc).

> Implementation:
>
> When a WAL replay on standby indicates that a catalog table tuple is
> to be deleted by an xid that is greater than a logical slot's
> catalog_xmin, then that means the slot's catalog_xmin conflicts with
> the xid, and we need to handle the conflict. While subsequent commits
> will do the actual conflict handling, this commit adds a new field
> isCatalogRel in such WAL records (and a new bit set in the
> xl_heap_visible flags field), that is true for catalog tables, so as to
> arrange for conflict handling.

You do mention it a bit here, but I think it could be more clear and
specific.

> diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
> index ba394f08f6..235f1a1843 100644
> --- a/src/backend/access/gist/gist.c
> +++ b/src/backend/access/gist/gist.c
> @@ -348,7 +348,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
> for (; ptr; ptr = ptr->next)
> {
> /* Allocate new page */
> - ptr->buffer = gistNewBuffer(rel);
> + ptr->buffer = gistNewBuffer(heapRel, rel);
> GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0);
> ptr->page = BufferGetPage(ptr->buffer);
> ptr->block.blkno = BufferGetBlockNumber(ptr->buffer);
> diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
> index d21a308d41..a87890b965 100644
> --- a/src/backend/access/gist/gistbuild.c
> +++ b/src/backend/access/gist/gistbuild.c
> @@ -298,7 +298,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
> Page page;
>
> /* initialize the root page */
> - buffer = gistNewBuffer(index);
> + buffer = gistNewBuffer(heap, index);
> Assert(BufferGetBlockNumber(buffer) == GIST_ROOT_BLKNO);
> page = BufferGetPage(buffer);
>
> diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
> index 56451fede1..119e34ce0f 100644
> --- a/src/backend/access/gist/gistutil.c
> +++ b/src/backend/access/gist/gistutil.c
> @@ -821,7 +821,7 @@ gistcheckpage(Relation rel, Buffer buf)
> * Caller is responsible for initializing the page by calling GISTInitBuffer
> */
> Buffer
> -gistNewBuffer(Relation r)
> +gistNewBuffer(Relation heaprel, Relation r)
> {

It is not very important, but I noticed you made "heaprel" the last
parameter to all of the btree-related functions but the first parameter
to the gist functions. I thought it might be nice to make the order
consistent. I also was wondering why you made it the last argument to
all the btree functions to begin with (i.e. instead of directly after
the first rel argument).

> diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
> index 8af33d7b40..9bdac12baf 100644
> --- a/src/include/access/gist_private.h
> +++ b/src/include/access/gist_private.h
> @@ -440,7 +440,7 @@ extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
> FullTransactionId xid, Buffer parentBuffer,
> OffsetNumber downlinkOffset);
>
> -extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
> +extern void gistXLogPageReuse(Relation heaprel, Relation rel, BlockNumber blkno,
> FullTransactionId deleteXid);
>
> extern XLogRecPtr gistXLogUpdate(Buffer buffer,
> @@ -485,7 +485,7 @@ extern bool gistproperty(Oid index_oid, int attno,
> extern bool gistfitpage(IndexTuple *itvec, int len);
> extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
> extern void gistcheckpage(Relation rel, Buffer buf);
> -extern Buffer gistNewBuffer(Relation r);
> +extern Buffer gistNewBuffer(Relation heaprel, Relation r);
> extern bool gistPageRecyclable(Page page);
> extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
> OffsetNumber off);
> diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
> index 09f9b0f8c6..191f0e5808 100644
> --- a/src/include/access/gistxlog.h
> +++ b/src/include/access/gistxlog.h
> @@ -51,13 +51,13 @@ typedef struct gistxlogDelete
> {
> TransactionId snapshotConflictHorizon;
> uint16 ntodelete; /* number of deleted offsets */
> + bool isCatalogRel;

In some of these struct definitions, I think it would help comprehension
to have a comment explaining the purpose of this member.

>
> - /*
> - * In payload of blk 0 : todelete OffsetNumbers
> - */
> + /* TODELETE OFFSET NUMBERS */
> + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];

Thanks for all your hard work on this feature!

Best,
Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Justin Pryzby 2023-01-23 23:28:14 Re: Cygwin cleanup
Previous Message Peter Smith 2023-01-23 22:15:53 Re: Time delayed LR (WAS Re: logical replication restrictions)