Re: Proposal to use JSON for Postgres Parser format

From: Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal to use JSON for Postgres Parser format
Date: 2022-10-27 14:38:46
Message-ID: CACxu=v+EvGfn-3LLOfMLAuzf7P4-CvSWAJMvtvS8KYXhNMWPAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 21, 2022 at 11:04 AM Matthias van de Meent <
boekewurm+postgres(at)gmail(dot)com> wrote:

> On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
> wrote:
> >
> > On 2022-Sep-20, Matthias van de Meent wrote:
> >
> > > Allow me to add: compressability
> > >
> > > In the thread surrounding [0] there were complaints about the size of
> > > catalogs, and specifically the template database. Significant parts of
> > > that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
> > > consists mostly of serialized Nodes. If we're going to replace our
> > > current NodeToText infrastructure, we'd better know we can effectively
> > > compress this data.
> >
> > True. Currently, the largest ev_action values compress pretty well. I
> > think if we wanted this to be more succint, we would have to invent some
> > binary format -- perhaps something like Protocol Buffers: it'd be stored
> > in the binary format in catalogs, but for output it would be converted
> > into something easy to read (we already do this for
> > pg_statistic_ext_data for example). We'd probably lose compressibility,
> > but that'd be okay because the binary format would already remove most
> > of the redundancy by nature.
> >
> > Do we want to go there?
>
> I don't think that a binary format would be much better for
> debugging/fixing than an optimization of the current textual format
> when combined with compression.

I agree, JSON is not perfect, but it compresses and it's usable
everywhere. My personal need for this is purely developer experience, and
Tom pointed out, a "niche" need for sure, but we are starting to do some
serious work with Dan Lynch's plpgsql deparser tool to generate RLS
policies from meta schema models, and having the same format come out of
the parser would make a complete end to end solution for us, especially if
we can get this data from a function in a ddl_command_start event trigger.
Dan also writes a popular deparser for Javascript, and unifying the formats
across these tools would be a big win for us.

> As I mentioned in that thread, there
> is a lot of improvement possible with the existing format, and I think
> any debugging of serialized nodes would greatly benefit from using a
> textual format.
>

Agreed.

> Then again, I also agree that this argument doesn't hold it's weight
> when storage and output formats are going to be different. I trust
> that any new tooling introduced as a result of this thread will be
> better than what we have right now.
>

Separating formats seems like a lot of work to me, to get what might not be
a huge improvement over compressing JSON, for what seems unlikely to be
more than a few megabytes of parsed SQL.

> As for best format: I don't know. The current format is usable, and a
> better format would not store any data for default values. JSON can do
> that, but I could think of many formats that could do the same (Smile,
> BSON, xml, etc.).
>
> I do not think that protobuf is the best choice for storage, though,
> because it has its own rules on what it considers a default value and
> what it does or does not serialize: zero is considered the only
> default for numbers, as is the empty string for text, etc.
> I think it is allright for general use, but with e.g. `location: -1`
> in just about every parse node we'd probably want to select our own
> values to ignore during (de)serialization of fields.
>

Agreed.

Thank you everyone who has contributed to this thread, I'm pleased that it
got a very spirited debate and I apologize for the delay in getting back to
everyone.

I'd like to spike on a proposed patch that:

- Converts the existing text format to JSON (or possibly jsonb,
considering feedback from this thread)
- Can be stored compressed
- Can be passed to a ddl_command_start event trigger with a function.

Thoughts?

-Michel

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2022-10-27 15:08:01 Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE
Previous Message Zhang Mingli 2022-10-27 13:37:04 Re: Reducing duplicativeness of EquivalenceClass-derived clauses