From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Önder Kalacı <onderkalaci(at)gmail(dot)com> |
Cc: | david(at)pgmasters(dot)net, Euler Taveira <euler(at)timbira(dot)com(dot)br>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Amit Langote <amitlangote09(at)gmail(dot)com>, movead li <movead(dot)li(at)highgo(dot)ca>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: row filtering for logical replication |
Date: | 2021-01-28 02:20:32 |
Message-ID: | 20210128022032.eq2qqc6zxkqn5syt@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2020-12-17 09:43:30 +0300, Önder Kalacı wrote:
> The above part can be considered the core of the logic, executed per tuple.
> As far as I can see, it has two downsides.
>
> First, calling `expression_planner()` for every tuple can be quite
> expensive. I created a sample table, loaded data and ran a quick benchmark
> to see its effect. I attached the very simple script that I used to
> reproduce the issue on my laptop. I'm pretty sure you can find nicer ways
> of doing similar perf tests, just sharing as a reference.
>
> The idea of the test is to add a WHERE clause to a table, but none of the
> tuples are filtered out. They just go through this code-path and send it to
> the remote node.
>
> #rows Patched | Master
> 1M 00:00:25.067536 | 00:00:16.633988
> 10M 00:04:50.770791 | 00:02:40.945358
>
>
> So, it seems a significant overhead to me. What do you think?
That seems almost prohibitively expensive. I think at the very least
some of this work would need to be done in a cached manner, e.g. via
get_rel_sync_entry().
> Secondly, probably more importantly, allowing any operator is as dangerous
> as allowing any function as users can create/overload operator(s).
That's not safe, indeed. It's not even just create/overloading
operators, as far as I can tell the expression can contain just plain
function calls.
The issue also isn't primarily that the user can overload functions,
it's that logical decoding is a limited environment, and not everything
is safe to do within. You e.g. only catalog tables can be
accessed. Therefore I don't think we can allow arbitrary expressions.
> The other problematic area was the performance, as calling
> `expression_planner()` for every tuple can be very expensive. To avoid
> that, it might be considered to ask users to provide a function instead of
> a free form WHERE clause, such that if the function returns true, the tuple
> is sent. The allowed functions need to be immutable SQL functions with bool
> return type. As we can parse the SQL functions, we should be able to allow
> only functions that rely on the above mentioned procs. We can apply as many
> restrictions (such as no modification query) as possible. For example, see
> below:
> ```
I don't think that would get us very far.
From a safety aspect: A function's body can be changed by the user at
any time, therefore we cannot rely on analyses of the function's body.
From a performance POV: SQL functions are planned at every invocation,
so that'd not buy us much either.
I think what you would have to do instead is to ensure that the
expression is "simple enough", and then process it into a cheaply
executable format in get_rel_sync_entry(). I'd suggest that in the first
version you just allow a simple ANDed list of 'foo.bar op constant'
expressions.
Does that make sense?
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Tang, Haiying | 2021-01-28 02:40:32 | RE: [POC] Fast COPY FROM command for the table with foreign partitions |
Previous Message | Peter Geoghegan | 2021-01-28 01:42:18 | Re: Thoughts on "killed tuples" index hint bits support on standby |