Re: Replication identifiers, take 4

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Steve Singer <steve(at)ssinger(dot)info>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Replication identifiers, take 4
Date: 2015-04-17 08:54:51
Message-ID: 20150417085451.GJ2361@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2015-04-12 22:02:38 +0300, Heikki Linnakangas wrote:
> This needs to be weighed against removing the padding bytes
> altogether.

Hrmpf. Says the person that used a lot of padding, without much
discussion, for the WAL level infrastructure making pg_rewind more
maintainable. And you deemed to be perfectly ok to use them up to avoid
*increasing* the WAL size with the *additional data* (which so far
nothing but pg_rewind needs in that way). While it perfectly well could
have been used to shrink the WAL size to less than it now is. And that's
*far*, *far* harder to back out/refactor changes than this (which are
pretty localized and thus can easily be changed); to the point that I
think it's infeasible to do so...

If you want to shrink the WAL size, send in a patch independently. Not
as a way to block somebody else implementing something.

> I'm surprised there's such a big difference between the "extern" and
> "padding" versions above. At a quick approximation, storing the ID as a
> separate "fragment", along with XLogRecordDataHeaderShort and
> XLogRecordDataHeaderLong, should add one byte of overhead plus the ID
> itself. So that would be 3 extra bytes for 2-byte identifiers, or 5 bytes
> for 4-byte identifiers. Does that mean that the average record length is
> only about 30 bytes?

Yes, nearly. That's xlogdump --stats=record from the above scenario with
replication identifiers used and reusing the padding:

Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
Transaction/COMMIT 50003 ( 16.89) 2600496 ( 23.38) 0 ( -nan) 2600496 ( 23.38)
CLOG/ZEROPAGE 1 ( 0.00) 28 ( 0.00) 0 ( -nan) 28 ( 0.00)
Standby/RUNNING_XACTS 5 ( 0.00) 248 ( 0.00) 0 ( -nan) 248 ( 0.00)
Heap2/CLEAN 46034 ( 15.55) 1473088 ( 13.24) 0 ( -nan) 1473088 ( 13.24)
Heap2/VISIBLE 2 ( 0.00) 56 ( 0.00) 0 ( -nan) 56 ( 0.00)
Heap/INSERT 49682 ( 16.78) 1341414 ( 12.06) 0 ( -nan) 1341414 ( 12.06)
Heap/HOT_UPDATE 150013 ( 50.67) 5700494 ( 51.24) 0 ( -nan) 5700494 ( 51.24)
Heap/INPLACE 5 ( 0.00) 130 ( 0.00) 0 ( -nan) 130 ( 0.00)
Heap/INSERT+INIT 318 ( 0.11) 8586 ( 0.08) 0 ( -nan) 8586 ( 0.08)
Btree/VACUUM 2 ( 0.00) 56 ( 0.00) 0 ( -nan) 56 ( 0.00)
-------- -------- -------- --------
Total 296065 11124596 [100.00%] 0 [0.00%] 11124596 [100%

(The FPI percentage display above is arguably borked. Interesting.)

So the average record size is ~37.5 bytes including the increased commit
record size due to the origin information (which is the part that
increases the size for that version that reuses the padding).

This *most definitely* isn't representative of every workload. But it
*is* *a* common type of workload.

Note that --stats will *not* show the size difference in xlog records
when adding data as an additional chunk vs. padding as it uses
XLogRecGetDataLen() to compute the record length... That confused me for
a while.

> That doesn't sound right, 30 bytes is very little.

Well, it's mostly HOT_UPDATES and INSERTS into not indexed tables. So
that's not too surprising. Obviously that'd look different with FPIs
enabled.

> Perhaps the size
> of the records created by pgbench happen to cross a 8-byte alignment
> boundary at that point, making a big difference. In another workload,
> there might be no difference at all, due to alignment.

Right.

> Also, you don't need to tag every record type with the replication ID. All
> indexam records can skip it, for starters, since logical decoding doesn't
> care about them. That should remove a fair amount of bloat.

Yes. I mentioned that. It's additional complexity because now the
decision has to be made at each xlog insertion callsite. Making
refactoring this into a different representation a bit harder. I don't
think it will make that much of a differenced in the above workload
(just CLEAN will be smaller); but it clearly might in others.

I've attached a rebased patch, that adds decision about origin logging
to the relevant XLogInsert() callsites for "external" 2 byte identifiers
and removes the pad-reusing version in the interest of moving forward. I
still don't see a point in using 4 byte identifiers atm, given the above
numbers that just seems like a waste for unrealistic use cases (>2^16
nodes). It's just two lines to change if we feel the need in the future.

Working on fixing the issue with WAL logging of deletions and
rearranging docs as Petr suggested. Not sure if the latter will really
look good, but I guess we'll see ;)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
0001-Introduce-replication-identifiers-v1.1.patch text/x-patch 125.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2015-04-17 09:04:41 Re: Replication identifiers, take 4
Previous Message Andres Freund 2015-04-17 08:38:56 Re: INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0