Re: row filtering for logical replication

From: Andres Freund <andres(at)anarazel(dot)de>
To: Önder Kalacı <onderkalaci(at)gmail(dot)com>
Cc: david(at)pgmasters(dot)net, Euler Taveira <euler(at)timbira(dot)com(dot)br>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Amit Langote <amitlangote09(at)gmail(dot)com>, movead li <movead(dot)li(at)highgo(dot)ca>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: row filtering for logical replication
Date: 2021-01-28 02:20:32
Message-ID: 20210128022032.eq2qqc6zxkqn5syt@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2020-12-17 09:43:30 +0300, Önder Kalacı wrote:
> The above part can be considered the core of the logic, executed per tuple.
> As far as I can see, it has two downsides.
>
> First, calling `expression_planner()` for every tuple can be quite
> expensive. I created a sample table, loaded data and ran a quick benchmark
> to see its effect. I attached the very simple script that I used to
> reproduce the issue on my laptop. I'm pretty sure you can find nicer ways
> of doing similar perf tests, just sharing as a reference.
>
> The idea of the test is to add a WHERE clause to a table, but none of the
> tuples are filtered out. They just go through this code-path and send it to
> the remote node.
>
> #rows Patched | Master
> 1M 00:00:25.067536 | 00:00:16.633988
> 10M 00:04:50.770791 | 00:02:40.945358
>
>
> So, it seems a significant overhead to me. What do you think?

That seems almost prohibitively expensive. I think at the very least
some of this work would need to be done in a cached manner, e.g. via
get_rel_sync_entry().

> Secondly, probably more importantly, allowing any operator is as dangerous
> as allowing any function as users can create/overload operator(s).

That's not safe, indeed. It's not even just create/overloading
operators, as far as I can tell the expression can contain just plain
function calls.

The issue also isn't primarily that the user can overload functions,
it's that logical decoding is a limited environment, and not everything
is safe to do within. You e.g. only catalog tables can be
accessed. Therefore I don't think we can allow arbitrary expressions.

> The other problematic area was the performance, as calling
> `expression_planner()` for every tuple can be very expensive. To avoid
> that, it might be considered to ask users to provide a function instead of
> a free form WHERE clause, such that if the function returns true, the tuple
> is sent. The allowed functions need to be immutable SQL functions with bool
> return type. As we can parse the SQL functions, we should be able to allow
> only functions that rely on the above mentioned procs. We can apply as many
> restrictions (such as no modification query) as possible. For example, see
> below:
> ```

I don't think that would get us very far.

From a safety aspect: A function's body can be changed by the user at
any time, therefore we cannot rely on analyses of the function's body.

From a performance POV: SQL functions are planned at every invocation,
so that'd not buy us much either.

I think what you would have to do instead is to ensure that the
expression is "simple enough", and then process it into a cheaply
executable format in get_rel_sync_entry(). I'd suggest that in the first
version you just allow a simple ANDed list of 'foo.bar op constant'
expressions.

Does that make sense?

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tang, Haiying 2021-01-28 02:40:32 RE: [POC] Fast COPY FROM command for the table with foreign partitions
Previous Message Peter Geoghegan 2021-01-28 01:42:18 Re: Thoughts on "killed tuples" index hint bits support on standby