Re: making the backend's json parser work in frontend code

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Mark Dilger <mark(dot)dilger(at)enterprisedb(dot)com>, David Steele <david(at)pgmasters(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: making the backend's json parser work in frontend code
Date: 2020-01-23 18:02:39
Message-ID: CA+TgmoZm+d7YzVzLiskL5PQTPT5X-KfO8SmOonYb=5uV-fQ9CQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 23, 2020 at 12:24 PM Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> I do have files with Latin-1-encoded names in my filesystem, even though
> my system is UTF-8, so I understand the problem. I was wondering if it
> would work to encode any non-UTF8-valid name using something like
> base64; the encoded name will be plain ASCII and can be put in the
> manifest, probably using a different field of the JSON object -- so for
> a normal file you'd have { path => '1234/2345' } but for a
> Latin-1-encoded file you'd have { path_base64 => '4Wx2YXJvLmNvbmYK' }.
> Then it's the job of the tool to ensure it decodes the name to its
> original form when creating/querying for the file.

Right. That's what I meant, a couple of messages back, when I
mentioned an extra layer of escaping, but your explanation here is
better because it's more detailed.

> A problem I have with this idea is that this is very corner-casey, so
> most tool implementors will never realize that there's a need to decode
> certain file names.

That's a valid concern. I would not necessarily have thought that
out-of-core tools would find a lot of use in reading them, provided
PostgreSQL itself both knows how to generate them and how to validate
them, but the interest in this topic suggests that people do care
about that.

Mostly, I think this issue shows the folly of imagining that putting
everything into JSON is a good idea because it gets rid of escaping
problems. Actually, what it does is create multiple kinds of escaping
problems. With the format I proposed, you only have to worry that the
file name might contain a tab character, because in that format, tab
is the delimiter. But, if we use JSON, then we've got the same problem
with JSON's delimiter, namely a double quote, which the JSON parser
will solve for you. We then have this additional and somewhat obscure
problem with invalidly encoded data, to which JSON itself provides no
solution. We have to invent our own, probably along the lines of what
you have just proposed. I think one can reasonably wonder whether this
is really an improvement over just inventing a way to escape tabs.

That said, there are other reasons to want to go with JSON, most of
all the fact that it's easy to see how to extend the format to
additional fields. Once you decide that each file will have an object,
you can add any keys that you like to that object and things should
scale up nicely. It has been my contention that we probably will not
find the need to add much more here, but such arguments are always
suspect and have a good chance of being wrong. Also, such prophecies
can be self-fulfilling: if the format is easy to extend, then people
may extend it, whereas if it is hard to extend, they may not try, or
they may try and then give up.

At the end of the day, I'm willing to make this work either way. I do
not think that my original proposal was bad, but there were things not
to like about it. There are also things not to like about using a
JSON-based format, and this seems to me to be a fairly significant
one. However, both sets of problems are solvable, and neither design
is awful. It's just a question of which kind of warts we like better.
To be blunt, I've already spent a lot more effort on this problem than
I would have liked, and more than 90% of it has been spent on the
issue of how to format a file that only PostgreSQL needs to read and
write. While I do not think that good file formats are unimportant, I
remain unconvinced that switching to JSON is making things better. It
seems like it's just making them different, while inflating the amount
of coding required by a fairly significant multiple.

That being said, unless somebody objects in the next few days, I am
going to assume that the people who preferred JSON over a
tab-separated file are also happy with the idea of using base-64
encoding as proposed by you above to represent files whose names are
not valid UTF-8 sequences; and I will then go rewrite the patch that
generates the manifests to use that format, rewrite the validator
patch to parse that format using this infrastructure, and hopefully
end up with something that can be reviewed and committed before we run
out of time to get things done for this release. If anybody wants to
vote for another plan, please vote soon.

In the meantime, any review of the new patches I posted here yesterday
would be warmly appreciated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2020-01-23 18:05:50 Re: making the backend's json parser work in frontend code
Previous Message Tom Lane 2020-01-23 17:53:16 Re: Allow to_date() and to_timestamp() to accept localized names