Re: COPY FROM WHEN condition

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Surafel Temesgen <surafel3000(at)gmail(dot)com>
Cc: Adam Berlin <berlin(dot)ab(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: COPY FROM WHEN condition
Date: 2018-11-28 23:17:31
Message-ID: c1f302f0-d806-6c64-fe6c-3febe03a4eb1@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/26/18 2:25 PM, Surafel Temesgen wrote:
>
>
> On Sat, Nov 24, 2018 at 5:09 AM Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com <mailto:tomas(dot)vondra(at)2ndquadrant(dot)com>> wrote:
>
>
> 1) While comparing this to the FILTER clause we already have for
> aggregates, I've noticed the aggregate version is
>
>     FILTER '(' WHERE a_expr ')'
>
> while here we have
>
>     FILTER '(' a_expr ')'
>
> For a while I was thinking that maybe we should use the same syntax
> here, but I don't think so. The WHERE bit seems rather unnecessary and
> we probably implemented it only because it's required by SQL standard,
> which does not apply to COPY. So I think this is fine.
>
>  
> ok
>
>
> 2) The various parser checks emit errors like this:
>
>     case EXPR_KIND_COPY_FILTER:
>         err = _("cannot use subquery in copy from FILTER condition");
>         break;
>
> I think the "copy from" should be capitalized, to make it clear that it
> refers to a COPY FROM command and not a copy of something.
>
>
> 3) I think there should be regression tests for these prohibited things,
> i.e. for a set-returning function, for a non-existent column, etc.
>
>
> fixed
>
>
> 4) What might be somewhat confusing for users is that the filter uses a
> single snapshot to evaluate the conditions for all rows. That is, one
> might do this
>
>     create or replace function f() returns int as $$
>         select count(*)::int from t;
>     $$ language sql;
>
> and hope that
>
>     copy t from '/...' filter (f() <= 100);
>
> only ever imports the first 100 rows - but that's not true, of course,
> because f() uses the snapshot acquired at the very beginning. For
> example INSERT SELECT does behave differently:
>
>     test=# copy t from '/home/user/t.data' filter (f() < 100);
>     COPY 81
>     test=# insert into t select * from t where f() < 100;
>     INSERT 0 19
>
> Obviously, this is not an issue when the filter clause references only
> the input row (which I assume will be the primary use case).
>
> Not sure if this is expected / appropriate behavior, and if the patch
> needs to do something else here.
>
> Comparing with overhead of setting snapshot before evaluating every row
> and considering this
>
> kind of usage is not frequent it seems to me the behavior is acceptable
>

I'm not really buying the argument that this behavior is acceptable
simply because using the feature like this will be uncommon. That seems
like a rather weak reason to accept it.

I however agree we don't want to make COPY less efficient, at least not
unless absolutely necessary. But I think we can handle this simply by
restricting what's allowed to appear the FILTER clause.

It should be fine to allow IMMUTABLE and STABLE functions, but not
VOLATILE ones. That should fix the example I shared, because f() would
not be allowed.

We could mark is as STABLE explicitly, which would make it usable in the
FILTER clause, but STABLE says "same result for all calls in the
statement" so the behavior would be expected and essentially legal (in a
way it'd be mislabeling, but we trust it on other places too).

So I think there are four options:

(a) accept the current behavior (single snapshot, same result for all
function calls)

(b) prohibit VOLATILE functions in the FILTER clause altogether

(c) allow VOLATILE functions in the FILTER clause, but change the
behavior to make the behavior sane

I definitely vote -1 on (a) unless someone presents a much better
argument for allowing it than "it's uncommon".

Which leaves us with (b) and (c). Clearly, (b) is simpler to implement,
because it (c) needs to do the detection too, and then some additional
stuff. I'm not sure how much more complex (c) is, compared to (b).

After eyeballing the copy.c code for a bit, I think it would need to use
CIM_SINGLE when there are volatile functions, similarly to volatile
default values (see volatile_defexprs), and then increment the command
ID after each insert. Of course, we don't want to do this by default,
but only when actually needed (with VOLATILE functions), because the
multi-inserts are quite a bit more efficient.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2018-11-29 00:16:01 Re: idle-in-transaction timeout error does not give a hint
Previous Message David Rowley 2018-11-28 23:04:37 Re: PostgreSQL Limits and lack of documentation about them.