Re: Proposal to use JSON for Postgres Parser format

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Geoghegan <pg(at)bowt(dot)ie>, Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal to use JSON for Postgres Parser format
Date: 2022-09-20 11:37:13
Message-ID: CAEze2WhWoDv+nwSRyN_D6tNwnhxC5Uq0yyip=1vPEVPaqkANTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Peter Geoghegan <pg(at)bowt(dot)ie> writes:
> > > On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >> Our existing format is certainly not great on those metrics, but
> > >> I do not see how "let's use JSON!" is a route to improvement.
> >
> > > The existing format was designed with developer convenience as a goal,
> > > though -- despite my complaints, and in spite of your objections.
> >
> > As Munro adduces nearby, it'd be a stretch to conclude that the current
> > format was designed with any Postgres-related goals in mind at all.
> > I think he's right that it's a variant of some Lisp-y dump format that's
> > probably far hoarier than even Berkeley Postgres.
> >
> > > If it didn't have to be easy (or even practical) for developers to
> > > directly work with the output format, then presumably the format used
> > > internally could be replaced with something lower level and faster. So
> > > it seems like the two goals (developer ergonomics and faster
> > > interchange format for users) might actually be complementary.
> >
> > I think the principal mistake in what we have now is that the storage
> > format is identical to the "developer friendly" text format (plus or
> > minus some whitespace). First we need to separate those. We could
> > have more than one equivalent text format perhaps, and I don't have
> > any strong objection to basing the text format (or one of them) on
> > JSON.
>
> +1 for considering storage format and text format separately.
>
> Let's consider what our criteria could be for the storage format.
>
> 1) Storage effectiveness (shorter is better) and
> serialization/deserialization effectiveness (faster is better). On
> this criterion, the custom binary format looks perfect.
> 2) Robustness in the case of corruption. It seems much easier to
> detect the data corruption and possibly make some partial manual
> recovery for textual format.
> 3) Standartness. It's better to use something known worldwide or at
> least used in other parts of PostgreSQL than something completely
> custom. From this perspective, JSON/JSONB is better than custom
> things.

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

In that same thread, I also suggested that we could try to not emit a
Node's fields if they contain their default values while serializing;
such as the common `:location -1` or `:mynodefield <>`. Those fields
still take up space in the format, while conveying no interesting
information (the absense of that field in the struct definition would
convey the same). It would be useful if this new serialized format
would allow us to do similar tricks cheaply.

As for JSON vs JSONB for storage:
I'm fairly certain that JSONB is less compact than JSON (without
taking compression into the picture) due to the 4-byte guaranteed
overhead for each jsonb element; while for JSON that is only 2 bytes
for each (up to 3 when you consider separators, plus potential extra
overhead for escaped values that are unlikely to appear our catalogs).
Some numbers can be stored more efficiently in JSONB, but only large
numbers and small fractions that we're unlikely to hit in system
views: a back-of-the-envelope calculation puts the cutoff point of
efficient storage between strings-of-decimals and Numeric at >10^12, <
-10^11, or very precise fractional values.

Kind regards,

Matthias van de Meent

[0] https://www.postgresql.org/message-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2022-09-20 11:43:17 Re: binary version of pg_current_wal_insert_lsn and pg_walfile_name functions
Previous Message Florin Irion 2022-09-20 11:33:45 Re: pg_create_logical_replication_slot argument incongruency