Re: WIP Incremental JSON Parser

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Jacob Champion <jacob(dot)champion(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Smith <smithpb2250(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: WIP Incremental JSON Parser
Date: 2024-03-18 10:32:32
Message-ID: CAD5tBcLjFnTnAvUpQg-K5sJa3ECnEj8CQZ-AYrH1k+F=42D+Vg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Mar 14, 2024 at 3:35 PM Jacob Champion <
jacob(dot)champion(at)enterprisedb(dot)com> wrote:

> I've been poking at the partial token logic. The json_errdetail() bug
> mentioned upthread (e.g. for an invalid input `[12zz]` and small chunk
> size) seems to be due to the disconnection between the "main" lex
> instance and the dummy_lex that's created from it. The dummy_lex
> contains all the information about the failed token, which is
> discarded upon an error return:
>
> > partial_result = json_lex(&dummy_lex);
> > if (partial_result != JSON_SUCCESS)
> > return partial_result;
>
> In these situations, there's an additional logical error:
> lex->token_start is pointing to a spot in the string after
> lex->token_terminator, which breaks an invariant that will mess up
> later pointer math. Nothing appears to be setting lex->token_start to
> point into the partial token buffer until _after_ the partial token is
> successfully lexed, which doesn't seem right -- in addition to the
> pointer math problems, if a previous chunk was freed (or on a stale
> stack frame), lex->token_start will still be pointing off into space.
> Similarly, wherever we set token_terminator, we need to know that
> token_start is pointing into the same buffer.
>
> Determining the end of a token is now done in two separate places
> between the partial- and full-lexer code paths, which is giving me a
> little heartburn. I'm concerned that those could drift apart, and if
> the two disagree on where to end a token, we could lose data into the
> partial token buffer in a way that would be really hard to debug. Is
> there a way to combine them?
>

Not very easily. But I think and hope I've fixed the issue you've
identified above about returning before lex->token_start is properly set.

Attached is a new set of patches that does that and is updated for the
json_errdetaiil() changes.

cheers

andrew

Attachment Content-Type Size
v11-0001-Introduce-a-non-recursive-JSON-parser.patch text/x-patch 424.9 KB
v11-0002-Add-support-for-incrementally-parsing-backup-man.patch text/x-patch 7.6 KB
v11-0003-Use-incremental-parsing-of-backup-manifests.patch text/x-patch 11.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2024-03-18 10:48:43 Re: Adding OLD/NEW support to RETURNING
Previous Message Bertrand Drouvot 2024-03-18 10:12:15 Re: Introduce XID age and inactive timeout based replication slot invalidation