Re: row filtering for logical replication

From: "Euler Taveira" <euler(at)eulerto(dot)com>
To: "Rahila Syed" <rahilasyed90(at)gmail(dot)com>
Cc: Önder Kalacı <onderkalaci(at)gmail(dot)com>, japin <japinli(at)hotmail(dot)com>, "Michael Paquier" <michael(at)paquier(dot)xyz>, "David Steele" <david(at)pgmasters(dot)net>, "Craig Ringer" <craig(at)2ndquadrant(dot)com>, "Tomas Vondra" <tomas(dot)vondra(at)2ndquadrant(dot)com>, "Amit Langote" <amitlangote09(at)gmail(dot)com>, "PostgreSQL Hackers" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: row filtering for logical replication
Date: 2021-03-22 02:15:10
Message-ID: 7e4b5c48-a3d3-4b7e-b455-2d50c60a0ab7@www.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 18, 2021, at 7:51 AM, Rahila Syed wrote:
> 1.
> I think the docs are being incorrectly updated to add a column to pg_partitioned_table
> instead of pg_publication_rel.
Good catch.

> 2. +typedef struct PublicationRelationQual
> +{
> + Oid relid;
> + Relation relation;
> + Node *whereClause;
> +} PublicationRelationQual;
>
> Can this be given a more generic name like PublicationRelationInfo, so that the same struct
> can be used to store additional relation information in future, for ex. column names, if column filtering is introduced.
Good idea. I rename it and it'll be in this next patch set.

> 3. Also, in the above structure, it seems that we can do with storing just relid and derive relation information from it
> using table_open when needed. Am I missing something?
We need the Relation. See OpenTableList(). The way this code is organized, it
opens all publication tables and append each Relation to a list. This list is
used in PublicationAddTables() to update the catalog. I tried to minimize the
number of refactors while introducing this feature. We could probably revise
this code in the future (someone said in a previous discussion that it is weird
to open relations in one source code file -- publicationcmds.c -- and use it
into another one -- pg_publication.c).

> 4. Currently in logical replication, I noticed that an UPDATE is being applied on the subscriber even if the column values
> are unchanged. Can row-filtering feature be used to change it such that, when all the OLD.columns = NEW.columns, filter out
> the row from being sent to the subscriber. I understand this would need REPLICA IDENTITY FULL to work, but would be an
> improvement from the existing state.
This is how Postgres works.

postgres=# create table foo (a integer, b integer);
CREATE TABLE
postgres=# insert into foo values(1, 100);
INSERT 0 1
postgres=# select ctid, xmin, xmax, a, b from foo;
ctid | xmin | xmax | a | b
-------+--------+------+---+-----
(0,1) | 488920 | 0 | 1 | 100
(1 row)

postgres=# update foo set b = 101 where a = 1;
UPDATE 1
postgres=# select ctid, xmin, xmax, a, b from foo;
ctid | xmin | xmax | a | b
-------+--------+------+---+-----
(0,2) | 488921 | 0 | 1 | 101
(1 row)

postgres=# update foo set b = 101 where a = 1;
UPDATE 1
postgres=# select ctid, xmin, xmax, a, b from foo;
ctid | xmin | xmax | a | b
-------+--------+------+---+-----
(0,3) | 488922 | 0 | 1 | 101
(1 row)

You could probably abuse this feature and skip some UPDATEs when old tuple is
identical to new tuple. The question is: why would someone issue the same
command multiple times? A broken application? I would say: don't do it. Besides
that, this feature could impose an overhead into a code path that already
consume substantial CPU time. I've seen some tables with RIF and dozens of
columns that would certainly contribute to increase the replication lag.

> 5. Currently, any existing rows that were not replicated, when updated to match the publication quals
> using UPDATE tab SET pub_qual_column = 'not_filtered' where a = 1; won't be applied, as row
> does not exist on the subscriber. It would be good if ALTER SUBSCRIBER REFRESH PUBLICATION
> would help fetch such existing rows from publishers that match the qual now(either because the row changed
> or the qual changed)
I see. This should be addressed by a resynchronize feature. Such option is
useful when you have to change the row filter. It should certainly be implement
as an ALTER SUBSCRIPTION subcommand.

I attached a new patch set that addresses:

* fix documentation;
* rename PublicationRelationQual to PublicationRelationInfo;
* remove the memset that was leftover from a previous patch set;
* add new tests to improve coverage (INSERT/UPDATE/DELETE to exercise the row
filter code).

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachment Content-Type Size
v12-0001-Rename-a-WHERE-node.patch text/x-patch 1.7 KB
v12-0002-Row-filter-for-logical-replication.patch text/x-patch 62.3 KB
v12-0003-Print-publication-WHERE-condition-in-psql.patch text/x-patch 1.2 KB
v12-0004-Publication-WHERE-condition-support-for-pg_dump.patch text/x-patch 2.9 KB
v12-0005-Measure-row-filter-overhead.patch text/x-patch 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2021-03-22 02:22:07 Re: Wrong statistics for size of XLOG_SWITCH during pg_waldump.
Previous Message Euler Taveira 2021-03-22 01:58:14 Re: row filtering for logical replication