Re: row filtering for logical replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Ajin Cherian <itsajin(at)gmail(dot)com>, Euler Taveira <euler(at)eulerto(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Önder Kalacı <onderkalaci(at)gmail(dot)com>, japin <japinli(at)hotmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, David Steele <david(at)pgmasters(dot)net>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Amit Langote <amitlangote09(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: row filtering for logical replication
Date: 2021-09-25 04:23:42
Message-ID: CAA4eK1LTCaGy2rnFR148SGmwqJw65uXf-j9hvkE7z6uuH-xBKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Sep 25, 2021 at 3:30 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 9/24/21 7:20 AM, Amit Kapila wrote:
> >
> > I think the right way to support functions is by the explicit marking
> > of functions and in one of the emails above Jeff Davis also agreed
> > with the same. I think we should probably introduce a new marking for
> > this. I feel this is important because without this it won't be safe
> > to access even some of the built-in functions that can access/update
> > database (non-immutable functions) due to logical decoding environment
> > restrictions.
> >
>
> I agree that seems reasonable. Is there any reason why not to just use
> IMMUTABLE for this purpose? Seems like a good match to me.
>

It will just solve one part of the puzzle (related to database access)
but it is better to avoid the risk of broken replication by explicit
marking especially for UDFs or other user-defined objects. You seem to
be okay documenting such risk but I am not sure we have an agreement
on that especially because that was one of the key points of
discussions in this thread and various people told that we need to do
something about it. I personally feel we should do something if we
want to allow user-defined functions or operators because as reported
in the thread this problem has been reported multiple times. I think
we can go ahead with IMMUTABLE built-ins for the first version and
then allow UDFs later or let's try to find a way for explicit marking.

> Yes, the user can lie and label something that is not really IMMUTABLE,
> but that's his fault. Yes, it's harder to fix than e.g. for indexes.
>

Agreed and I think we can't do anything about this.

> >>
> >> 12) misuse of REPLICA IDENTITY
> >>
> >> The more I think about this, the more I think we're actually misusing
> >> REPLICA IDENTITY for something entirely different. The whole purpose of
> >> RI was to provide a row identifier for the subscriber.
> >>
> >> But now we're using it to ensure we have all the necessary columns,
> >> which is entirely orthogonal to the original purpose. I predict this
> >> will have rather negative consequences.
> >>
> >> People will either switch everything to REPLICA IDENTITY FULL, or create
> >> bogus unique indexes with extra columns. Which is really silly, because
> >> it wastes network bandwidth (transfers more data) or local resources
> >> (CPU and disk space to maintain extra indexes).
> >>
> >> IMHO this needs more infrastructure to request extra columns to decode
> >> (e.g. for the filter expression), and then remove them before sending
> >> the data to the subscriber.
> >>
> >
> > Yeah, but that would have an additional load on write operations and I
> > am not sure at this stage but maybe there could be other ways to
> > extend the current infrastructure wherein we build the snapshots using
> > which we can access the user tables instead of only catalog tables.
> > Such enhancements if feasible would be useful not only for allowing
> > additional column access in row filters but for other purposes like
> > allowing access to functions that access user tables. I feel we can
> > extend this later as well seeing the usage and requests. For the first
> > version, this doesn't sound too limiting to me.
> >
>
> I'm not really buying the argument that this means overhead for write
> operations. Well, it does, but the current RI approach is forcing users
> to either use RIF or add an index covering the filter attributes.
> Neither of those options is free, and I'd bet the extra overhead of
> adding just the row filter columns would be actually lower.
>
> If the argument is merely to limit the scope of this patch, fine.
>

Yeah, that is one and I am not sure that adding extra WAL is the best
or only solution for this problem. As mentioned in my previous
response, I think we eventually need to find a way to access user
tables to support UDFs (that access database) or sub-query which other
databases already support, and for that, we might need to enhance the
current snapshot mechanism after which we might not need any
additional WAL even for additional columns in row filter. I don't
think anyone of us has evaluated in detail the different ways this
problem can be solved and the pros/cons of each approach, so limiting
the scope for this purpose doesn't seem like a bad idea to me.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2021-09-25 05:21:51 Re: Fixing WAL instability in various TAP tests
Previous Message Peter Geoghegan 2021-09-25 03:08:19 Re: decoupling table and index vacuum