Re: problems with making relfilenodes 56-bits

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: problems with making relfilenodes 56-bits
Date: 2022-09-29 16:24:12
Message-ID: CAEze2WjOFzRzPMPYhH4odSa9OCF2XeZszE3jGJhJzrpdFmyLOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 29 Sep 2022, 00:06 Robert Haas, <robertmhaas(at)gmail(dot)com> wrote:
>
> 2. WAL Size. Block references in the WAL are by RelFileLocator, so if
> you make RelFileLocators bigger, WAL gets bigger. We'd have to test
> the exact impact of this, but it seems a bit scary: if you have a WAL
> stream with few FPIs doing DML on a narrow table, probably most
> records will contain 1 block reference (and occasionally more, but I
> guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
> block reference feels like it might add up to something significant. I
> don't really see any way around this, either: if you make relfilenode
> values wider, they take up more space. Perhaps there's a way to claw
> that back elsewhere, or we could do something really crazy like switch
> to variable-width representations of integer quantities in WAL
> records, but there doesn't seem to be any simple way forward other
> than, you know, deciding that we're willing to pay the cost of the
> additional WAL volume.

Re: WAL volume and record size optimization

I've been working off and on with WAL for some time now due to [0] and
the interest of Neon in the area, and I think we can reduce the size
of the base record by a significant margin:

Currently, our minimal WAL record is exactly 24 bytes: length (4B),
TransactionId (4B), previous record pointer (8B), flags (1B), redo
manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
fields, TransactionID could reasonably be omitted for certain WAL
records (as example: index insertions don't really need the XID).
Additionally, the length field could be made to be variable length,
and any padding is just plain bad (adding 4 bytes to all
insert/update/delete/lock records was frowned upon).

I'm working on a prototype patch for a more bare-bones WAL record
header of which the only required fields would be prevptr (8B), CRC
(4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
yet know the performance of this, but the considering that there will
be a lot more conditionals in header decoding it might be slower for
any one backend, but faster overall (less overall IOps)

The flags field would be indications for additional information: [flag
name (bits): explanation (additional xlog header data in bytes)]
- len_size(0..1): xlog record size is at most xlrec_header_only (0B),
uint8_max(1B), uint16_max(2B), uint32_max(4B)
- has_xid (2): contains transaction ID of logging transaction (4B, or
probably 8B when we introduce 64-bit xids)
- has_cid (3): contains the command ID of the logging statement (4B)
(rationale for logging CID in [0], now in record header because XID is
included there as well, and both are required for consistent
snapshots.
- has_rminfo (4): has non-zero redo-manager flags field (1B)
(rationale for separate field [1], non-zero allows 1B space
optimization for one of each RMGR's operations)
- special_rel (5): pre-existing definition
- check_consistency (6): pre-existing definition
- unset (7): no meaning defined yet. Could be used for full record
compression, or other purposes.

A normal record header (XLOG record with at least some registered
data) would be only 15 to 17 bytes (0-1B rminfo + 1-2B in xl_len), and
one with XID only up to 21 bytes. So, when compared to the current
XLogRecord format, we would in general recover 2 or 3 bytes from the
xl_tot_len field, 1 or 2 bytes from the alignment hole, and
potentially the 4 bytes of the xid when that data is considered
useless during recovery, or physical or logical replication.

Kind regards,

Matthias van de Meent

[0] https://postgr.es/m/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc%3DSdEHHsW6HS%2Bk9zw%40mail.gmail.com
[1] https://postgr.es/m/20220715173731.6t3km5cww3f5ztfq%40awork3.anarazel.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2022-09-29 16:40:02 Re: Should setting TupleTableSlotOps get_heap_tuple=NULL break INSERT..ON CONFLICT DO UPDATE..RETURNING?
Previous Message Nathan Bossart 2022-09-29 16:23:32 Re: identifying the backend that owns a temporary schema