Re: Reducing output size of nodeToString

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Michel Pelletier <pelletier(dot)michel(at)gmail(dot)com>
Subject: Re: Reducing output size of nodeToString
Date: 2024-01-03 23:23:50
Message-ID: CAEze2Wigkd1+J4s=7wUqW8Y4g9mDWSC28119ukbKkf799WBpzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2 Jan 2024 at 11:30, Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> > - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> > - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> > - Reset more fields to their default values before storing the values.
> > - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
>
> Based on our discussions, my understanding is that you wanted to produce
> an updated patch set that is split up a bit.

I mentioned that I've been working on implementing (but have not yet
completed) a binary serialization format, with an implementation based
on Andres' generated metadata idea. However, that requires more
elaborate infrastructure than is currently available, so while I said
I'd expected it to be complete before the Christmas weekend, it'll
take some more time - I'm not sure it'll be ready for PG17.

In the meantime here's an updated version of the v0 patch, formally
keeping the textual format alive, while reducing the size
significantly (nearing 2/3 reduction), taking your comments into
account. I think the gains are worth the consideration without taking
into account the as-of-yet unimplemented binary format.

> My suggestion is to make incremental patches along these lines:
> [...]

Something like the attached? It splits out into the following
0001: basic 'omit default values'
0002: reset location and other querystring-related node fields for all
catalogs of type pg_node_tree.
0003: add default marking on typmod fields.
0004 & 0006: various node fields marked with default() based on
observed common or initial values of those fields
0005: truncate trailing 0s from outDatum
0007 (new): do run-length + gap coding for bitmapset and the various
integer list types. This saves a surprising amount of bytes.

> The last one I have some doubts about, as previously expressed, but the
> first few seem sensible to me. By splitting it up we can consider these
> incrementally.

That makes a lot of sense. The numbers for the full patchset do seem
quite positive though: The metrics of the query below show a 40%
decrease in size of a fresh pg_rewrite (standard toast compression)
and a 5% decrease in size of the template0 database. The uncompressed
data of pg_rewrite.ev_action is also 60% smaller.

select pg_database_size('template0') as "template0"
, pg_total_relation_size('pg_rewrite') as "pg_rewrite"
, sum(pg_column_size(ev_action)) as "compressed"
, sum(octet_length(ev_action)) as "raw"
from pg_rewrite;

version | template0 | pg_rewrite | compressed | raw
---------|-----------+------------+------------+---------
master | 7545359 | 761856 | 573307 | 2998712
0001 | 7365135 | 622592 | 438224 | 1943772
0002 | 7258639 | 573440 | 401660 | 1835803
0003 | 7258639 | 565248 | 386211 | 1672539
0004 | 7176719 | 483328 | 317099 | 1316552
0005 | 7176719 | 483328 | 315556 | 1300420
0006 | 7160335 | 466944 | 302806 | 1208621
0007 | 7143951 | 450560 | 287659 | 1187237

While looking through the data, I noticed the larger views now consist
for a significant portion out of range table entries, specifically the
Alias and Var nodes (which are mostly repeated and/or repetative
values, but split across Nodes). I think column-major storage would be
more efficient to write, but I'm not sure it's worth the effort in
planner code.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachment Content-Type Size
v1-0001-pg_node_tree-Don-t-serialize-fields-with-type-def.patch application/octet-stream 22.8 KB
v1-0002-pg_node_tree-reset-node-location-before-catalog-s.patch application/octet-stream 12.9 KB
v1-0005-NodeSupport-Don-t-emit-trailing-0s-in-outDatum.patch application/octet-stream 2.4 KB
v1-0004-NodeSupport-add-some-more-default-markers-for-var.patch application/octet-stream 4.6 KB
v1-0003-Nodesupport-add-support-for-custom-default-values.patch application/octet-stream 13.2 KB
v1-0007-NodeSupport-Apply-RLE-and-differential-encoding-o.patch application/octet-stream 6.5 KB
v1-0006-NodeSupport-Apply-some-more-defaults-serializatio.patch application/octet-stream 16.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2024-01-03 23:25:59 Re: add function argument names to regex* functions.
Previous Message Cedric Villemain 2024-01-03 23:23:43 Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED