Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

From: Terry Laurenzo <tj(at)laurenzo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Date: 2010-10-19 21:39:53
Message-ID: AANLkTi=Lc2wOp1XMguj_OMQoWYTW5V507gbtirZur-xe@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Oct 19, 2010 at 2:46 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Oct 19, 2010 at 3:40 PM, Joseph Adams
> <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> > On Tue, Oct 19, 2010 at 3:17 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> >> I think we should take a few steps back and ask why we think that
> >> binary encoding is the way to go. We store XML as text, for example,
> >> and I can't remember any complaints about that on -bugs or
> >> -performance, so why do we think JSON will be different? Binary
> >> encoding is a trade-off. A well-designed binary encoding should make
> >> it quicker to extract a small chunk of a large JSON object and return
> >> it; however, it will also make it slower to return the whole object
> >> (because you're adding serialization overhead). I haven't seen any
> >> analysis of which of those use cases is more important and why.
> >
> > Speculation: the overhead involved with retrieving/sending and
> > receiving/storing JSON (not to mention TOAST
> > compression/decompression) will be far greater than that of
> > serializing/unserializing.
>
> I speculate that your speculation is incorrect. AIUI, we, unlike
> $COMPETITOR, tend to be CPU-bound rather than IO-bound on COPY. But
> perhaps less speculation and more benchmarking is in order.
>
>
After spending a week in the morass of this, I have to say that I am less
certain than I was on any front regarding the text/binary distinction. I'll
take some time and benchmark different cases. My hypothesis is that a well
implemented binary structure and conversions will add minimal overhead in
the IO + Validate case which would be the typical in/out flow. It could be
substantially faster for binary send/receive because the validation step
could be eliminated/reduced. Further storing as binary reduces the overhead
of random access to the data by database functions.

I'm envisioning staging this up as follows:
1. Create a "jsontext". jsontext uses text as its internal
representation. in/out functions are essentially a straight copy or a copy
+ validate.
2. Create a "jsonbinary" type. This uses an optimized binary format for
internal rep and send/receive. in/out is a parse/transcode operation to
standard JSON text.
3. Internal data access functions and JSON Path require a jsonbinary.
4. There are implicit casts to/from jsontext and jsonbinary.

I've got a grammar in mind for the binary structure that I'll share later
when I've got some more time. It's inspired by $COMPETITOR's format but a
little more sane, using type tags that implicitly define the size of the
operands, simplifying parsing.

I'll then define the various use cases and benchmark using the different
types. Some examples include such as IO No Validate, IO+Validate, Store and
Index, Internal Processing, Internal Composition, etc.

The answer may be to have both a jsontext and jsonbinary type as each will
be optimized for a different case.

Make sense? It may be a week before I get through this.
Terry

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2010-10-19 21:42:43 Re: WIP: extensible enums
Previous Message Bruce Momjian 2010-10-19 21:38:37 pgsql: Restructure the pg_upgrade code to use several global structures