Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

From: Terry Laurenzo <tj(at)laurenzo(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Date: 2010-10-16 16:59:34
Message-ID: AANLkTinxBrLLx_bjNB0pT4kk3Qnt-Zwb_aU8ibz7m8Ea@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all -
I independently started some work on a similar capability as was contributed
back in August by Joey Adams for a json datatype. Before starting, I did a
quick search but for some reason didn't turn this existing thread up.

What I've been working on is out on github for now:
http://github.com/tlaurenzo/pgjson

When I started, I was actually aiming for something else, and got caught up
going down this rabbit hole. I took a different design approach, making the
internal form be an extended BSON stream and implementing event-driven
parsing and serializing to the different formats. There was some discussion
in the original thread around storing plain text vs a custom format. I have
to admit I've been back and forth a couple of times on this and have come to
like a BSON-like format for the data at rest.

Pros:
- It is directly iterable without parsing and/or constructing an AST
- It is its own representation. If iterating and you want to tear-off a
value to be returned or used elsewhere, its a simple buffer copy plus some
bit twiddling.
- It is conceivable that clients already know how to deal with BSON,
allowing them to work with the internal form directly (ala MongoDB)
- It stores a wider range of primitive types than JSON-text. The most
important are Date and binary.

Cons:
- The format appears to have been "grown". Some of the critical
decisions were made late in the game (ie. why would your integral type codes
be last)
- Natively, the format falls victim to the artificial document vs element
distinction, which I never understood. I've worked around this with an
escape mechanism for representing root values, but its not great.
- The processor is not resilient in the face of unknown element types

I'm leaning towards thinking that the determination comes down to the
following:
- If you just want a "checkbox" item that the database has a json
datatype and some support functions, storing as text may make sense. It can
be much simpler; however, it becomes increasingly hard to do anything real
without adding a parse to AST, manipulate, dump to text cycle to every
function.
- If you want a json datatype that is highly integrated and manipulable,
you want a binary datastructure and in the absence of any other contender in
this area, BSON is ok (not great, but ok).
- The addition of a JavaScript procedural language probably does not
bring its own format for data at rest. All of the engines I know of (I
haven't looked at what Guile is doing) do not have a static representation
for internal data structures. They are heap objects with liberal use of
internal and external pointers. Most do have a mechanism, however, for
injecting foreign objects into the runtime without resorting to making a
dumb copy. As such, the integration approach would probably be to determine
the best format for JSON data at rest and provide adapters to the chosen
JavaScript runtime to manipulate this at-rest format directly (potentially
using a copy on write approach). If the at-rest format is Text, then you
would need to do a parse-to-AST step for each JavaScript function
invocation.

Here's a few notes on my current implementation:
- Excessive use of lex/yacc: This was quick and easy but the grammars are
simple enough that I'd probably hand-code a parser for any final solution.
- When the choice between following the json.org spec to the letter and
implementing lenient parsing for valid JavaScript constructs arose, I chose
lenient.
- Too much buffer copying: When I started, I was just doodling with
writing C code to manipulate JSON/BSON and not working with postgres in
particular. As such, it all uses straight malloc/free and too many copies
are made to get things in and out of VARDATA structures. This would all be
eliminated in any real version.
- UTF-8 is supported but not fully working completely. The support
functions that Joey wrote do a better job at this.
- My json path evaluation is crippled. Given the integration with the PG
type system, I thought I just wanted a simple property traversal mechanism,
punting higher level manipulation to native PG functions. Seeing real
JSONPath work, though, I'm not so sure. I like the simplicity of what I've
done but the features of the full bit are nice too.
- This is first-pass prototype code with the goal of seeing it all
working together.

While I had an end in mind, I did a lot of this for the fun of it and to
just scratch an itch, so I'm not really advocating for anything at this
point. I'm curious as to what others think the state of JSON and Postgres
should be. I've worked with JavaScript engines a good deal and would be
happy to help get us there, either using some of the work/approaches here or
going in a different direction.

Terry

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2010-10-16 17:17:48 Re: knngist - 0.8
Previous Message Paul Ramsey 2010-10-16 16:23:36 Re: knngist - 0.8