Re: WIP Incremental JSON Parser

From: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP Incremental JSON Parser
Date: 2024-03-18 19:34:51
Message-ID: CAOYmi+n1fxCt3QGz9g=GfY1+J6__vVqsAud=HYwpUrfZM_VJHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 18, 2024 at 3:32 AM Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> Not very easily. But I think and hope I've fixed the issue you've identified above about returning before lex->token_start is properly set.
>
> Attached is a new set of patches that does that and is updated for the json_errdetaiil() changes.

Thanks!

> ++ * Normally token_start would be ptok->data, but it could be later,
> ++ * see json_lex_string's handling of invalid escapes.
> + */
> -+ lex->token_start = ptok->data;
> ++ lex->token_start = dummy_lex.token_start;
> + lex->token_terminator = ptok->data + ptok->len;

By the same token (ha), the lex->token_terminator needs to be updated
from dummy_lex for some error paths. (IIUC, on success, the
token_terminator should always point to the end of the buffer. If it's
not possible to combine the two code paths, maybe it'd be good to
check that and assert/error out if we've incorrectly pulled additional
data into the partial token.)

With the incremental parser, I think prev_token_terminator is not
likely to be safe to use except in very specific circumstances, since
it could be pointing into a stale chunk. Some documentation around how
to use that safely in a semantic action would be good.

It looks like some of the newly added error handling paths cannot be
hit, because the production stack makes it logically impossible to get
there. (For example, if it takes a successfully lexed comma to
transition into JSON_PROD_MORE_ARRAY_ELEMENTS to begin with, then when
we pull that production's JSON_TOKEN_COMMA off the stack, we can't
somehow fail to match that same comma.) Assuming I haven't missed a
different way to get into that situation, could the "impossible" cases
have assert calls added?

I've attached two diffs. One is the group of tests I've been using
locally (called 002_inline.pl; I replaced the existing inline tests
with it), and the other is a set of potential fixes to get those tests
green.

Thanks,
--Jacob

Attachment Content-Type Size
wip-fixes.diff.txt text/plain 1.0 KB
002_inline.diff.txt text/plain 6.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Rahul Uniyal 2024-03-18 19:52:00 Re: Java : Postgres double precession issue with different data format text and binary
Previous Message Roberto Mello 2024-03-18 19:06:38 Re: documentation structure