Re: WIP Incremental JSON Parser

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP Incremental JSON Parser
Date: 2024-04-01 23:52:59
Message-ID: 7ead3d4d-ddc6-2499-0a04-6b29c693ea4a@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2024-03-29 Fr 17:21, Jacob Champion wrote:
> On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>> Here's a new set of patches, with I think everything except the error case Asserts attended to. There's a change to add semantic processing to the test suite in patch 4, but I'd fold that into patch 1 when committing.
> Thanks! 0004 did highlight one last bug for me -- the return value
> from semantic callbacks is ignored, so applications can't interrupt
> the parsing early:
>
>> + if (ostart != NULL)
>> + (*ostart) (sem->semstate);
>> + inc_lex_level(lex);
> Anything not JSON_SUCCESS should be returned immediately, I think, for
> this and the other calls.

Fixed

>
>> + /* make sure we've used all the input */
>> + if (lex->token_terminator - lex->token_start != ptok->len)
>> + return JSON_INVALID_TOKEN;
> nit: Debugging this, if anyone ever hits it, is going to be confusing.
> The error message is going to claim that a valid token is invalid, as
> opposed to saying that the parser has a bug. Short of introducing
> something like JSON_INTERNAL_ERROR, maybe an Assert() should be added
> at least?

Done

>
>> + case JSON_NESTING_TOO_DEEP:
>> + return(_("Json nested too deep, maximum permitted depth is 6400"));
> nit: other error messages use all-caps, "JSON"

Fixed

>
>> + char chunk_start[100],
>> + chunk_end[100];
>> +
>> + snprintf(chunk_start, 100, "%s", ib->buf.data);
>> + snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - (MIN_CHUNK + 99)));
>> + elog(NOTICE, "incremental manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end);
> Is this NOTICE intended to be part of the final commit?

No, removed.

>
>> + inc_state = json_parse_manifest_incremental_init(&context);
>> +
>> + buffer = pg_malloc(chunk_size + 64);
> These `+ 64`s (there are two of them) seem to be left over from
> debugging. In v7 they were just `+ 1`.

Fixed

I've also added the Asserts to the error handling code you suggested.
This tests fine with the regression suite under FORCE_JSON_PSTACK.

>
> --
>
> >From this point onward, I think all of my feedback is around
> maintenance characteristics, which is firmly in YMMV territory, and I
> don't intend to hold up a v1 of the feature for it since I'm not the
> maintainer. :D
>
> The complexity around the checksum handling and "final chunk"-ing is
> unfortunate, but not a fault of this patch -- just a weird consequence
> of the layering violation in the format, where the checksum is inside
> the object being checksummed. I don't like it, but I don't think
> there's much to be done about it.

Yeah. I agree it's ugly, but I don't think we should let it hold us up
at all. Working around it doesn't involve too much extra code.

>
> By far the trickiest part of the implementation is the partial token
> logic, because of all the new ways there are to handle different types
> of failures. I think any changes to the incremental handling in
> json_lex() are going to need intense scrutiny from someone who already
> has the mental model loaded up. I'm going snowblind on the patch and
> I'm no longer the right person to review how hard it is to get up to
> speed with the architecture, but I'd say it's not easy.

json_lex() is not really a very hot piece of code. I have added a
comment at the top giving a brief description of the partial token
handling. Even without the partial token bit it can be a bit scary, but
I think the partial token piece here is not so scary that we should not
proceed.

>
> For something as security-critical as a JSON parser, I'd usually want
> to counteract that by sinking the observable behaviors in concrete and
> getting both the effective test coverage *and* the code coverage as
> close to 100% as we can stand. (By that, I mean that purposely
> introducing observable bugs into the parser should result in test
> failures. We're not there yet when it comes to the semantic callbacks,
> at least.) But I don't think the current JSON parser is held to that
> standard currently, so it seems strange for me to ask for that here.

Yeah.

>
> I think it'd be good for a v1.x of this feature to focus on
> simplification of the code, and hopefully consolidate and/or eliminate
> some of the duplicated parsing work so that the mental model isn't
> quite so big.
>

I'm not sure how you think that can be done. What parts of the
processing are you having difficulty in coming to grips with?

Anyway, here are new patches. I've rolled the new semantic test into the
first patch.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment Content-Type Size
v15-0001-Introduce-a-non-recursive-JSON-parser.patch.gz application/gzip 125.6 KB
v15-0002-Add-support-for-incrementally-parsing-backup-man.patch.gz application/gzip 2.3 KB
v15-0003-Use-incremental-parsing-of-backup-manifests.patch.gz application/gzip 3.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-04-01 23:53:58 Re: On disable_cost
Previous Message Tristan Partin 2024-04-01 23:46:01 Silence Meson warning on HEAD