Re: COPY FROM WHEN condition

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: COPY FROM WHEN condition
Date: 2018-11-02 13:17:46
Message-ID: 0623219e-c034-bc10-c546-a01da35e6349@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/02/2018 03:57 AM, Corey Huinker wrote:
> > Are you thinking something like having a COPY command that provides
> > results in such a way that they could be referenced in a FROM clause
> > (perhaps a COPY that defines a cursor…)?
>
> That would also be nice, but what I was thinking of was that some
> highly restricted subset of cases of SQL in general could lend
> themselves to levels of optimization that would be impractical in
> other contexts.
>
>
> If COPY (or a syntactical equivalent) can return a result set, then the
> whole of SQL is available to filter and aggregate the results and we
> don't have to invent new syntax, or endure confusion whenCOPY-WHEN
> syntax behaves subtly different from a similar FROM-WHERE.
>
> Also, what would we be saving computationally? The whole file (or
> program output) has to be consumed no matter what, the columns have to
> be parsed no matter what. At least some of the columns have to be
> converted to their assigned datatypes enough to know whether or not to
> filter the row, but we might be able push that logic inside a copy. I'm
> thinking of something like this:
>
> SELECT x.a, sum(x.b)
> FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer,
> b numeric, c text, d date, e json) )
> WHERE x.d >= '2018-11-01'
>
>
> In this case, there is the /opportunity/ to see the following optimizations:
> - columns c and e are never referenced, and need never be turned into a
> datum (though we might do so just to confirm that they conform to the
> data type)
> - if column d is converted first, we can filter on it and avoid
> converting columns a,b
> - whatever optimizations we can infer from knowing that the two
> surviving columns will go directly into an aggregate
>
> If we go this route, we can train the planner to notice other
> optimizations and add those mechanisms at that time, and then existing
> code gets faster.
>
> If we go the COPY-WHEN route, then we have to make up new syntax for
> every possible future optimization.

IMHO those two things address vastly different use-cases. The COPY WHEN
case deals with filtering data while importing them into a database,
while what you're describing seems to be more about querying data stored
in a CSV file. But we already do have a solution for that - FDW, and I'd
say it's working pretty well. And AFAIK it does give you tools to
implement most of what you're asking for. I don't see why should we bolt
this on top of COPY, or how is it an alternative to COPY WHEN.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2018-11-02 13:28:17 Re: WIP Patch: Add a function that returns binary JSONB as a bytea
Previous Message Tomas Vondra 2018-11-02 13:11:45 Re: zheap: a new storage format for PostgreSQL