| From: | Terry Laurenzo <tj(at)laurenzo(dot)org> | 
|---|---|
| To: | Robert Haas <robertmhaas(at)gmail(dot)com> | 
| Cc: | Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: patch: Add JSON datatype to PostgreSQL (GSoC, WIP) | 
| Date: | 2010-11-21 05:31:17 | 
| Message-ID: | AANLkTikL3dfjHsw1vpgHW7Cq=e6EUCDfoQXGYzUhpF+c@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
I've got a new stripped down version of the binary json plugin on github:
https://github.com/tlaurenzo/pgjson
With all due warning of contrived benchmarks, I wrote some tests to see
where things stand.   The test script is here:
    https://github.com/tlaurenzo/pgjson/blob/master/testdocs/runbenchmark.sh
The results from my laptop (First gen Macbook Pro 32bit with Snow Leopard
and Postgresql 9.0) are here and copied into this email at the end:
https://github.com/tlaurenzo/pgjson/blob/master/testdocs/benchmark-results-2010-11-20-mbp32.txt
And for some commentary...
I copied the 5 sample documents from json.org's example section for these
tests.  These are loaded into a table with a varchar column 1000 times each
(so the test table has 5000 rows in it).  In all situations, the binary
encoding was smaller than the normalized text form (between 9 and 23%
smaller).  I think there are cases where the binary form will be larger than
the corresponding text form, but I don't think they would be very common.
For the timings, various operations are performed on the 5000 row test table
for 100 iterations.  In all situations, the query returns the Length of the
transformed document instead of the document itself so as to factor out
variable client IO between the test runs.  The Null Parse is essentially
just a select from the table and therefore represents the baseline.  The
times varied a little bit between runs but did not change materially.
What we see from this is that parsing JSON text and generating a binary
representation is cheap, representing approximately 10% of the base case
time.  Conversely, anything that involves generating JSON text is expensive,
accounting for 30-40% of the base case time.  Some incidental profiling
shows that while the entire operation is expensive, the process of
generating string literals dominates this time.  There is likely room for
optimization in this method, but it should be noted that most of these
documents are lightly escaped (if escaped at all) which represents the happy
path through the string literal output function.
While I have not profiled any advanced manipulation of the binary structure
within the server, it stands to reason that manipulating the binary
structure should be significantly faster than an approach that requires
(perhaps multiple) transcoding between text representations in order to
complete a sequence of operations.
Assuming that the JSON datatype (at a minimum) normalizes text for storage,
then the text storage option accounts for about the most expensive path but
with none of the benefits of an internal binary form (smaller size, ability
to cheaply perform non-trivial manipulation within the database server).
Of course, just having a JSON datatype that blindly stores text will beat
everything, but I'm getting closer to thinking that the binary option is
worth the tradeoff.
Comments?
Terry
Running benchmark with 100 iterations per step
Loading data...
Data loaded.
=== DOCUMENT SIZE STATISTICS ===
    Document Name     | Original Size | Binary Size | Normalized Size |
Percentage Savings
----------------------+---------------+-------------+-----------------+--------------------
 jsonorg_sample1.json |           582 |         311 |             360 |
13.6111111111111
 jsonorg_sample2.json |           241 |         146 |             183 |
20.2185792349727
 jsonorg_sample3.json |           601 |         326 |             389 |
16.1953727506427
 jsonorg_sample4.json |          3467 |        2466 |            2710 |
9.00369003690037
 jsonorg_sample5.json |           872 |         469 |             613 |
23.4910277324633
(5 rows)
=== TEST PARSE AND SERIALIZATION ===
Null Parse:
       12.12 real         2.51 user         0.13 sys
Parse to Binary:
       13.38 real         2.51 user         0.14 sys
Serialize from Binary:
       16.65 real         2.51 user         0.13 sys
Normalize Text:
       18.99 real         2.51 user         0.13 sys
Roundtrip (parse to binary and serialize):
       18.58 real         2.51 user         0.14 sys
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Pavel Stehule | 2010-11-21 06:09:31 | Re: Fwd: patch: format function - fixed oid | 
| Previous Message | Vaibhav Kaushal | 2010-11-21 05:06:10 | Re: Fwd: What do these terms mean in the SOURCE CODE? |