Re: Reducing output size of nodeToString

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
Subject: Re: Reducing output size of nodeToString
Date: 2023-12-07 11:18:28
Message-ID: CAEze2Wh1qvMBLrs4v5FEJHsYSnJAEAXxibN0g+oDLCt_EtghoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 7 Dec 2023 at 11:26, Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> > - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> > - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> > - Reset more fields to their default values before storing the values.
> > - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
> >
> > Future work will probably have to be on a significantly different
> > storage format, as the textual format is about to hit its entropy
> > limits.
>
> One thing that was mentioned repeatedly is that we might want different
> formats for human consumption and for machine storage.
> For human consumption, I would like some format like what you propose,
> because it generally omits the "unset" or "uninteresting" fields.
>
> But since you also talk about the size of pg_rewrite, I wonder whether
> it would be smaller if we just didn't write the field names at all but
> instead all the field values. (This should be pretty easy to test,
> since the read functions currently ignore the field names anyway; you
> could just write out all field names as "x" and see what happens.)

I've been thinking about using a more binary storage format similar to
protobuf (but with system knowledge baked in, instead of PB's
defaults), but that would be dependent on functions that change the
output functions of pg_node_tree too, which Michel mentioned he would
work on a year ago (iiuc).

I think it would be a logical next step after this, but this patch is
just on building infrastructure that reduces the stored size without
getting in the way of Michel's work, if there was any result.

> I don't much like the way your patch uses the term "default". Most of
> these default values are not defaults at all, but perhaps "most common
> values".

Yes, some 'defaults' are curated, but they have sound logic behind
them: *typmod is essentially always copied from an attypmod, which
defaults to -1. *isnull for any constant is generally unset. Many of
those other fields (once initialized by the relevant code) default to
those values I used.

> In theory, I would expect a default value to be initialized by
> makeNode(). (That could be an interesting feature, but let's stay
> focused here.) But even then most of these "defaults" wouldn't be
> appropriate for a real default value. This part seems quite
> controversial to me, and I would like to see some more details about how
> much this specifically really saves.

The tuning of these "defaults" got the savings from 20-30% to this
50%+ reduction in raw size.

> I don't quite understand why in your patch you have some fields as
> optional and some not. Or is that what WRITE_NODE_FIELD() vs.
> WRITE_NODE_FIELD_OPT() means? How is it decided which one to use?

I use _OPT when I know the value is likely to be its defualt value,
and don't change over to _OPT when I know with great certainty the
value is going to be dynamic, such as relation ID in RTEs, but this is
only relevant for manual code as generated code essentially always
uses the _OPT paths.

> The part that clears out the location fields in pg_rewrite entries might
> be worth considering as a separate patch. Could you explain it more?
> Does it affect location pointers when using views at all?

Views don't store the original query string, so the location pointers
in views point to locations in a now non-existent query string.
Additionally, unless WRITE_READ_PARSE_PLAN_TREES is defined,
READ_LOCATION_FIELD does not actually read the stored value but
instead stores -1 in the indicated field, so in most cases there won't
be any difference between the deserialized data before and after this
part of the patch; the only difference is the amount of debugable
information stored in the view's internal data.
Note that resetting them to 'invalid' value thus makes sense, and
improves compressibility and allows removal from the serialized format
when serialization omits fields with default values.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2023-12-07 11:20:30 Re: Is WAL_DEBUG related code still relevant today?
Previous Message vignesh C 2023-12-07 11:14:36 Re: pg_upgrade and logical replication