Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Aaron Botsis <aaron(at)bt-r(dot)com>, pgsql-hackers(at)postgresql(dot)org
Cc: Francisco Olarte <folarte(at)peoplecall(dot)com>
Subject: Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes
Date: 2015-01-07 19:45:51
Message-ID: 54AD8CEF.3080904@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On 01/07/2015 08:25 AM, Aaron Botsis wrote:
> Hi folks, I was having a problem importing json data with COPY. Lots
> of things export data nicely as one json blob per line. This is
> excellent for directly importing into a JSON/JSONB column for analysis.
>
> ...Except when there’s an embedded doublequote. Or anything that’s
> escaped. COPY handles this, but by the time the escaped char hit the
> JSON parser, it's not escaped anymore. This breaks the JSON parsing.
> This means I need to manipulate the input data to double-escape it.
> See bug #12320 for an example. Yuck.
>
> I propose this small patch that simply allows specifying COPY … ESCAPE
> without requiring the CSV parser. It will make it much easier to
> directly use json formatted export data for folks going forward. This
> seemed like the simplest route.
>
> Usage is simply:
>
> postgres=# copy t1 from '/Users/nok/Desktop/queries.json';
> ERROR: invalid input syntax for type json
> DETAIL: Token "root" is invalid.
> CONTEXT: JSON data, line 1: ...1418066241619 AND <=1418671041621) AND
> user:"root...
> COPY t1, line 3, column bleh:
> "{"timestamp":"2014-12-15T19:17:32.505Z","duration":7.947,"query":{"query":{"filtered":{"filter":{"qu..."
> postgres=# copy t1 from '/Users/nok/Desktop/queries.json' escape '';
> COPY 1966
>
>

This isn't a bug. Neither CSV format nor TEXT format are partucularly
suitable for json. I'm quite certain I could compose legal json that
will break your proposal (for example, with an embedded newline in the
white space.)

It's also unnecessary. CSV format, while not designed for this, is
nevertheless sufficiently flexible to allow successful import of json
data meeting certain criteria (essentially no newlines), like this:

copy the_table(jsonfield)
from '/path/to/jsondata'
csv quote e'\x01' delimiter e'\x02';

You aren't the first person to encounter this problem. See
<http://adpgtech.blogspot.com/2014/09/importing-json-data.html>

Maybe we need to add something like this to the docs, or to the wiki.

Note too my comment in that blog post:

Now this solution is a bit of a hack. I wonder if there's a case for
a COPY mode that simply treats each line as a single datum. I also
wonder if we need some more specialized tools for importing JSON,
possibly one or more Foreign Data Wrappers. Such things could
handle, say, embedded newline punctuation.

cheers

andrew

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Stephen Frost 2015-01-07 20:35:09 Re: RLS bug?
Previous Message Aaron Botsis 2015-01-07 15:37:58 Re: Patch: [BUGS] BUG #12320: json parsing with embedded double quotes

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2015-01-07 19:51:32 Re: Possible typo in create_policy.sgml
Previous Message Stephen Frost 2015-01-07 19:42:23 Re: INSERT ... ON CONFLICT UPDATE and RLS