Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

From: Terry Laurenzo <tj(at)laurenzo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP)
Date: 2010-11-21 05:31:17
Message-ID: AANLkTikL3dfjHsw1vpgHW7Cq=e6EUCDfoQXGYzUhpF+c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I've got a new stripped down version of the binary json plugin on github:
https://github.com/tlaurenzo/pgjson

With all due warning of contrived benchmarks, I wrote some tests to see
where things stand. The test script is here:
https://github.com/tlaurenzo/pgjson/blob/master/testdocs/runbenchmark.sh
The results from my laptop (First gen Macbook Pro 32bit with Snow Leopard
and Postgresql 9.0) are here and copied into this email at the end:

https://github.com/tlaurenzo/pgjson/blob/master/testdocs/benchmark-results-2010-11-20-mbp32.txt

And for some commentary...
I copied the 5 sample documents from json.org's example section for these
tests. These are loaded into a table with a varchar column 1000 times each
(so the test table has 5000 rows in it). In all situations, the binary
encoding was smaller than the normalized text form (between 9 and 23%
smaller). I think there are cases where the binary form will be larger than
the corresponding text form, but I don't think they would be very common.

For the timings, various operations are performed on the 5000 row test table
for 100 iterations. In all situations, the query returns the Length of the
transformed document instead of the document itself so as to factor out
variable client IO between the test runs. The Null Parse is essentially
just a select from the table and therefore represents the baseline. The
times varied a little bit between runs but did not change materially.

What we see from this is that parsing JSON text and generating a binary
representation is cheap, representing approximately 10% of the base case
time. Conversely, anything that involves generating JSON text is expensive,
accounting for 30-40% of the base case time. Some incidental profiling
shows that while the entire operation is expensive, the process of
generating string literals dominates this time. There is likely room for
optimization in this method, but it should be noted that most of these
documents are lightly escaped (if escaped at all) which represents the happy
path through the string literal output function.

While I have not profiled any advanced manipulation of the binary structure
within the server, it stands to reason that manipulating the binary
structure should be significantly faster than an approach that requires
(perhaps multiple) transcoding between text representations in order to
complete a sequence of operations.

Assuming that the JSON datatype (at a minimum) normalizes text for storage,
then the text storage option accounts for about the most expensive path but
with none of the benefits of an internal binary form (smaller size, ability
to cheaply perform non-trivial manipulation within the database server).

Of course, just having a JSON datatype that blindly stores text will beat
everything, but I'm getting closer to thinking that the binary option is
worth the tradeoff.

Comments?
Terry

Running benchmark with 100 iterations per step
Loading data...
Data loaded.
=== DOCUMENT SIZE STATISTICS ===
Document Name | Original Size | Binary Size | Normalized Size |
Percentage Savings
----------------------+---------------+-------------+-----------------+--------------------
jsonorg_sample1.json | 582 | 311 | 360 |
13.6111111111111
jsonorg_sample2.json | 241 | 146 | 183 |
20.2185792349727
jsonorg_sample3.json | 601 | 326 | 389 |
16.1953727506427
jsonorg_sample4.json | 3467 | 2466 | 2710 |
9.00369003690037
jsonorg_sample5.json | 872 | 469 | 613 |
23.4910277324633
(5 rows)

=== TEST PARSE AND SERIALIZATION ===
Null Parse:
12.12 real 2.51 user 0.13 sys
Parse to Binary:
13.38 real 2.51 user 0.14 sys
Serialize from Binary:
16.65 real 2.51 user 0.13 sys
Normalize Text:
18.99 real 2.51 user 0.13 sys
Roundtrip (parse to binary and serialize):
18.58 real 2.51 user 0.14 sys

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2010-11-21 06:09:31 Re: Fwd: patch: format function - fixed oid
Previous Message Vaibhav Kaushal 2010-11-21 05:06:10 Re: Fwd: What do these terms mean in the SOURCE CODE?