Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

From: Terry Laurenzo <tj(at)laurenzo(dot)org>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Date: 2010-10-18 16:15:07
Message-ID: AANLkTi=uaD_M5vo491ohO6Gcmgkqi+02vVRYzF-aM+_0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > I like as simple design as we can accept. ISTM format, I/O interface,
> > simple get/set, mapping tuple from/to object, and indexing are minimum
> > requirement.
>
> +1 to small start, but simple get/set are already debatable...
> For example, text/json conversion:
> A. SELECT '<json>'::json;
> B. SELECT '<text>'::text::json;
>
> In the git repo, A calls parse_json_to_bson_as_vardata(), so the input
> should be a json format. OTOH, B calls pgjson_json_from_text(), so the
> input can be any text. Those behaviors are surprising. I think we have
> no other choice but to define text-to-json cast as parsing. The same
> can be said for json-to-text -- type-output function vs. extracting
> text value from json.
>
> I think casting text to/from json should behave in the same way as type
> input/output. The xml type works in the same manner. And if so, we might
> not have any casts to/from json for consistency, even though there are
> no problems in casts for non-text types.
>

I just reworked some of this last night, so I'm not sure which version you
are referring to (new version has a pgplugin/jsoncast.c source file). I was
basically circling around the same thing as you trying to find something
that felt natural and not confusing. I agree that we don't have much of a
choice to keep in/out functions as parse/serialize and that then introducing
casts that do differently creates confusion. When I was playing with it, I
was getting confused, and I wrote it. :)

An alternative to pg casting to extract postgres values could be to
introduce analogs to JavaScript constructors, which is the JavaScript way to
cast. For example: String(json), Number(json), Date(json). This would feel
natural to a JavaScript programmer and would be explicit and non-surprising:
A. SELECT String('{a: 1, b:2}'::json -> 'a') (Returns postgres text)
B. SELECT Number('1'::json) (Returns postgres decimal)

I think that the most common use case for this type of thing in the DB will
be to extract a JSON scalar as a postgres scalar.

The inverse, while probably less useful, is currently represented by the
json_from_* functions. We could collapse all of these down to one
overloaded function, say ToJson(...):
A. SELECT ToJson(1) (Would return a json type with an int32 "1" value)
B. SELECT ToJson('Some String') (Would return a json type with a string
value)

There might be some syntactic magic we could do by adding an intermediate
jsonscalar type, but based on trying real cases with this stuff, you always
end up having to be explicit about your conversions anyway. Having implicit
type coercion from this polymorphic type tends to make things confusing,
imo.

>
> I'll list issues before we start json types even in the simplest cases:
> ----
> 1. where to implement json core: external library vs. inner postgres
> 2. internal format: text vs. binary (*)
> 3. encoding: always UTF-8 vs. database encoding (*)
> 4. meaning of casts text to/from json: parse/stringify vs. get/set
> 5. parser implementation: flex/bison vs. hand-coded.
> ----
> (*) Note that we would have comparison two json values in the future. So,
> we might need to normalize the internal format even in text representation.
>
> The most interesting parts of json types, including indexing and jsonpath,
> would be made on the json core. We need conclusions about those issues.
>
>
My opinions or ramblings on the above:
1. There's a fair bit of code involved for something that many are going
to gloss over. I can think of pros/cons for external/internal/contrib and
I'm not sure which I would choose.
2. I'm definitely in the binary camp, but part of the reason for building
it out was to try it with some real world cases to get a feel for
performance implications end to end. We make heavy use of MongoDB at the
office and I was thinking it might make sense to strip some of those cases
down and see how they would be implemented in this context. I'll write up
more thoughts on how I think text/binary should perform for various cases
tonight.
3. I think if we go with binary, we should always store UTF-8 in the
binary structure. Otherwise, we just have too much of the guts of the
binary left to the whim of the database encoding. As currently implemented,
all strings generated by the in/out functions should be escaped so that they
are pure ascii (not quite working, but there in theory). JSON is by
definition UTF-8, and in this case, I think it trumps database encoding.
4. My thoughts on the casts are above.
5. There seems to be a lot of runtime and code size overhead inherent in
the flex/bison parsers, especially considering that they will most
frequently be invoked for very small streams. Writing a good hand-coded
parser for comparison is just a matter of which bottle of wine to choose
prior to spending the hours coding it, and I would probably defer the
decision until later.

Terry

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2010-10-18 17:07:08 Re: string function - "format" function proposal
Previous Message Greg Stark 2010-10-18 16:06:30 Re: Timeout and wait-forever in sync rep