Re: row filtering for logical replication

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Önder Kalacı <onderkalaci(at)gmail(dot)com>, japin <japinli(at)hotmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, David Steele <david(at)pgmasters(dot)net>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: row filtering for logical replication
Date: 2021-09-24 22:00:32
Message-ID: 9d5073b7-e461-d0f3-89b7-0ddb09936f73@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/24/21 7:20 AM, Amit Kapila wrote:
> On Thu, Sep 23, 2021 at 6:03 PM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>>
>> 6) parse_oper.c
>>
>> I'm having some second thoughts about (not) allowing UDFs ...
>>
>> Yes, I get that if the function starts failing, e.g. because querying a
>> dropped table or something, that breaks the replication and can't be
>> fixed without a resync.
>>
>
> The other problem is that users can access/query any table inside the
> function and that also won't work in a logical decoding environment as
> we use historic snapshots using which we can access only catalog
> tables.
>

True. I always forget about some of these annoying issues. Let's
document all of this in some comment / README. I see we still don't have

src/backend/replication/logical/README

which is a bit surprising, considering how complex this code is.

>> That's pretty annoying, but maybe disallowing anything user-defined
>> (functions and operators) is maybe overly anxious? Also, extensibility
>> is one of the hallmarks of Postgres, and disallowing all custom UDF and
>> operators seems to contradict that ...
>>
>> Perhaps just explaining that the expression can / can't do in the docs,
>> with clear warnings of the risks, would be acceptable.
>>
>
> I think the right way to support functions is by the explicit marking
> of functions and in one of the emails above Jeff Davis also agreed
> with the same. I think we should probably introduce a new marking for
> this. I feel this is important because without this it won't be safe
> to access even some of the built-in functions that can access/update
> database (non-immutable functions) due to logical decoding environment
> restrictions.
>

I agree that seems reasonable. Is there any reason why not to just use
IMMUTABLE for this purpose? Seems like a good match to me.

Yes, the user can lie and label something that is not really IMMUTABLE,
but that's his fault. Yes, it's harder to fix than e.g. for indexes.

>>
>> 12) misuse of REPLICA IDENTITY
>>
>> The more I think about this, the more I think we're actually misusing
>> REPLICA IDENTITY for something entirely different. The whole purpose of
>> RI was to provide a row identifier for the subscriber.
>>
>> But now we're using it to ensure we have all the necessary columns,
>> which is entirely orthogonal to the original purpose. I predict this
>> will have rather negative consequences.
>>
>> People will either switch everything to REPLICA IDENTITY FULL, or create
>> bogus unique indexes with extra columns. Which is really silly, because
>> it wastes network bandwidth (transfers more data) or local resources
>> (CPU and disk space to maintain extra indexes).
>>
>> IMHO this needs more infrastructure to request extra columns to decode
>> (e.g. for the filter expression), and then remove them before sending
>> the data to the subscriber.
>>
>
> Yeah, but that would have an additional load on write operations and I
> am not sure at this stage but maybe there could be other ways to
> extend the current infrastructure wherein we build the snapshots using
> which we can access the user tables instead of only catalog tables.
> Such enhancements if feasible would be useful not only for allowing
> additional column access in row filters but for other purposes like
> allowing access to functions that access user tables. I feel we can
> extend this later as well seeing the usage and requests. For the first
> version, this doesn't sound too limiting to me.
>

I'm not really buying the argument that this means overhead for write
operations. Well, it does, but the current RI approach is forcing users
to either use RIF or add an index covering the filter attributes.
Neither of those options is free, and I'd bet the extra overhead of
adding just the row filter columns would be actually lower.

If the argument is merely to limit the scope of this patch, fine. But
I'd bet the amount of code we'd have to add to ExtractReplicaIdentity
(or maybe somewhere close to it) would be fairly small. We'd need to
cache which columns are needed (like RelationGetIndexAttrBitmap), and
this might be a bit more complex, due to having to consider all the
publications etc.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2021-09-24 22:24:19 Re: Column Filtering in Logical Replication
Previous Message Melanie Plageman 2021-09-24 21:58:48 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)