Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'

From: Darafei "Komяpa" Praliaskouski <me(at)komzpa(dot)net>
To: Joel Jacobson <joel(at)compiler(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "David G(dot) Johnston" <david(dot)g(dot)johnston(at)gmail(dot)com>, Isaac Morland <isaac(dot)morland(at)gmail(dot)com>, Chapman Flack <chap(at)anastigmatix(dot)net>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'
Date: 2021-05-06 06:21:26
Message-ID: CAC8Q8tKm1vhY2JVCrce+Pg5+P+1VMdbGFV0zfRCNUZJAWSR_vQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I have similar problems and what is really needed is a way to get a file
from client side into a server side object that can be dealt with later.
The most popular way is COPY and it is built into the psql tool. In general
it supports \copy wrapper, and there is COPY FROM STDIN. However, it is not
available to the files that are not following the csv-like structure. I had
to use it for XML and huge JSON files before, and it's always `sed` before
the import and a replace() after.

pg_read_file does not help on cloud and managed installs of postgres here.

What I would prefer is some new COPY mode like RAW that will just push
whatever it gets on the stdin/input into the cell on the server side. This
way it can be proxied by psql, utilize existing infra for passing streams
and be used in shell scripting.

On Thu, May 6, 2021 at 9:14 AM Joel Jacobson <joel(at)compiler(dot)org> wrote:

> On Wed, May 5, 2021, at 20:45, Tom Lane wrote:
>
> "Joel Jacobson" <joel(at)compiler(dot)org> writes:
> > I think you misunderstood the problem.
> > I don't want the entire file to be considered a single value.
> > I want each line to become its own row, just a row with a single column.
>
> > So I actually think COPY seems like a perfect match for the job,
> > since it does precisely that, except there is no delimiter in this case.
>
> Well, there's more to it than just the column delimiter.
>
> * What about \N being converted to NULL?
> * What about \. being treated as EOF?
> * Do you want to turn off the special behavior of backslash (ESCAPE)
> altogether?
> * What about newline conversions (\r\n being seen as just \n, etc)?
>
> I'm inclined to think that "use pg_read_file and then split at newlines"
> might be a saner answer than delving into all these fine points.
> Not least because people yell when you add cycles to the COPY
> inner loops.
>
>
> Thanks for providing strong arguments why the COPY approach is a dead-end,
> I agree.
>
> However, as demonstrated in my previous email, using
>
> string_to_table(pg_read_file( filename ), E'\n')
>
> has its performance as well as max size issues.
>
> Maybe these two problems could be solved by combining the two functions
> into one?
>
> file_to_table ( filename text, delimiter text [, null_string text ] ) →
> setof text
>
> I'm thinking thanks to returning "setof text", such a function could read
> a stream,
> and return a line as soon as a delimiter is encountered, not having to keep
> the entire file in memory at any time.
>
> /Joel
>

--
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2021-05-06 06:25:25 Re: Asynchronous Append on postgres_fdw nodes.
Previous Message Joel Jacobson 2021-05-06 06:13:48 Re: COPY table_name (single_column) FROM 'unknown.txt' DELIMITER E'\n'