Re: Reducing output size of nodeToString

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Cc: Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
Subject: Re: Reducing output size of nodeToString
Date: 2023-12-07 10:26:10
Message-ID: ff666461-bbcf-4bbf-a3ac-262785004377@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06.12.23 22:08, Matthias van de Meent wrote:
> PFA a patch that reduces the output size of nodeToString by 50%+ in
> most cases (measured on pg_rewrite), which on my system reduces the
> total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> pg_node_tree format alive, but reduces its size signficantly.
>
> The basic techniques used are
> - Don't emit scalar fields when they contain a default value, and
> make the reading code aware of this.
> - Reasonable defaults are set for most datatypes, and overrides can
> be added with new pg_node_attr() attributes. No introspection into
> non-null Node/Array/etc. is being done though.
> - Reset more fields to their default values before storing the values.
> - Don't write trailing 0s in outDatum calls for by-ref types. This
> saves many bytes for Name fields, but also some other pre-existing
> entry points.
>
> Future work will probably have to be on a significantly different
> storage format, as the textual format is about to hit its entropy
> limits.

One thing that was mentioned repeatedly is that we might want different
formats for human consumption and for machine storage.

For human consumption, I would like some format like what you propose,
because it generally omits the "unset" or "uninteresting" fields.

But since you also talk about the size of pg_rewrite, I wonder whether
it would be smaller if we just didn't write the field names at all but
instead all the field values. (This should be pretty easy to test,
since the read functions currently ignore the field names anyway; you
could just write out all field names as "x" and see what happens.)

I don't much like the way your patch uses the term "default". Most of
these default values are not defaults at all, but perhaps "most common
values". In theory, I would expect a default value to be initialized by
makeNode(). (That could be an interesting feature, but let's stay
focused here.) But even then most of these "defaults" wouldn't be
appropriate for a real default value. This part seems quite
controversial to me, and I would like to see some more details about how
much this specifically really saves.

I don't quite understand why in your patch you have some fields as
optional and some not. Or is that what WRITE_NODE_FIELD() vs.
WRITE_NODE_FIELD_OPT() means? How is it decided which one to use?

The part that clears out the location fields in pg_rewrite entries might
be worth considering as a separate patch. Could you explain it more?
Does it affect location pointers when using views at all?

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-12-07 10:29:00 Re: Improve WALRead() to suck data directly from WAL buffers when possible
Previous Message Andrey M. Borodin 2023-12-07 10:19:12 Re: Proposal to add page headers to SLRU pages