Re: COPY FROM WHEN condition

From: David Fetter <david(at)fetter(dot)org>
To: Corey Huinker <corey(dot)huinker(at)gmail(dot)com>
Cc: nasbyj(at)amazon(dot)com, surafel3000(at)gmail(dot)com, cmt(at)burggraben(dot)net, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: COPY FROM WHEN condition
Date: 2018-11-02 14:21:32
Message-ID: 20181102142132.GJ12677@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 01, 2018 at 10:57:25PM -0400, Corey Huinker wrote:
> >
> > > Are you thinking something like having a COPY command that provides
> > > results in such a way that they could be referenced in a FROM clause
> > > (perhaps a COPY that defines a cursor…)?
> >
> > That would also be nice, but what I was thinking of was that some
> > highly restricted subset of cases of SQL in general could lend
> > themselves to levels of optimization that would be impractical in
> > other contexts.
>
> If COPY (or a syntactical equivalent) can return a result set, then the
> whole of SQL is available to filter and aggregate the results and we don't
> have to invent new syntax, or endure confusion whenCOPY-WHEN syntax behaves
> subtly different from a similar FROM-WHERE.

That's an excellent point.

> Also, what would we be saving computationally? The whole file (or program
> output) has to be consumed no matter what, the columns have to be parsed no
> matter what. At least some of the columns have to be converted to their
> assigned datatypes enough to know whether or not to filter the row, but we
> might be able push that logic inside a copy. I'm thinking of something like
> this:
>
> SELECT x.a, sum(x.b)
> FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) )

Apologies for bike-shedding, but wouldn't the following be a better
fit with the current COPY?

COPY t(a integer, b numeric, c text, d date, e json) FROM '/path/to/foo.txt' WITH (FORMAT CSV, INLINE)

> WHERE x.d >= '2018-11-01'
>
>
> In this case, there is the *opportunity* to see the following optimizations:
> - columns c and e are never referenced, and need never be turned into a
> datum (though we might do so just to confirm that they conform to the data
> type)

That sounds like something that could go inside the WITH extension
I'm proposing above.

[STRICT_TYPE boolean DEFAULT true]?

This might not be something that has to be in version 1.

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2018-11-02 14:34:04 Re: Hash Joins vs. Bloom Filters / take 2
Previous Message Tom Lane 2018-11-02 14:07:10 Re: WIP: Avoid creation of the free space map for small tables