Re: problems with making relfilenodes 56-bits

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: problems with making relfilenodes 56-bits
Date: 2022-09-29 21:58:59
Message-ID: CA+Tgmoa7pNxxe_K=3mTHHZGSmnrc_YgApArx3OFHN2g57nzLNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 29, 2022 at 12:24 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> Currently, our minimal WAL record is exactly 24 bytes: length (4B),
> TransactionId (4B), previous record pointer (8B), flags (1B), redo
> manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
> fields, TransactionID could reasonably be omitted for certain WAL
> records (as example: index insertions don't really need the XID).
> Additionally, the length field could be made to be variable length,
> and any padding is just plain bad (adding 4 bytes to all
> insert/update/delete/lock records was frowned upon).

Right. I was shocked when I realized that we had two bytes of padding
in there, considering that numerous rmgrs are stealing bits from the
1-byte field that identifies the record type. My question was: why
aren't we exposing those 2 bytes for rmgr-type-specific use? Or for
something like xl_xact_commit, we could get rid of xl_xact_info if we
had those 2 bytes to work with.

Right now, I see that a bare commit record is 34 bytes which rounds
out to 40. With the trick above, we could shave off 4 bytes bringing
the size to 30 which would round to 32. That's a pretty significant
savings, although it'd be a lot better if we could get some kind of
savings for DML records which could be much higher frequency.

> I'm working on a prototype patch for a more bare-bones WAL record
> header of which the only required fields would be prevptr (8B), CRC
> (4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
> yet know the performance of this, but the considering that there will
> be a lot more conditionals in header decoding it might be slower for
> any one backend, but faster overall (less overall IOps)
>
> The flags field would be indications for additional information: [flag
> name (bits): explanation (additional xlog header data in bytes)]
> - len_size(0..1): xlog record size is at most xlrec_header_only (0B),
> uint8_max(1B), uint16_max(2B), uint32_max(4B)
> - has_xid (2): contains transaction ID of logging transaction (4B, or
> probably 8B when we introduce 64-bit xids)
> - has_cid (3): contains the command ID of the logging statement (4B)
> (rationale for logging CID in [0], now in record header because XID is
> included there as well, and both are required for consistent
> snapshots.
> - has_rminfo (4): has non-zero redo-manager flags field (1B)
> (rationale for separate field [1], non-zero allows 1B space
> optimization for one of each RMGR's operations)
> - special_rel (5): pre-existing definition
> - check_consistency (6): pre-existing definition
> - unset (7): no meaning defined yet. Could be used for full record
> compression, or other purposes.

Interesting. One fly in the ointment here is that WAL records start on
8-byte boundaries (probably MAXALIGN boundaries, but I didn't check
the details). And after the 24-byte header, there's a 2-byte header
(or 5-byte header) introducing the payload data (see
XLR_BLOCK_ID_DATA_SHORT/LONG). So if the size of the actual payload
data is a multiple of 8, and is short enough that we use the short
data header, we waste 6 bytes. If the data length is a multiple of 4,
we waste 2 bytes. And those are probably really common cases. So the
big improvements probably come from saving 2 bytes or 6 bytes or 10
bytes, and saving say 3 or 5 is probably not much better than 2. Or at
least that's what I'm guessing.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2022-09-29 22:53:16 Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
Previous Message Tom Lane 2022-09-29 19:33:26 Re: [PATCH] Introduce array_shuffle() and array_sample()