<div>And yet, if I try to implement a similar mechanism, if successful, will my revision be considered?</div><div> </div><div><span style="background:#ffffff;color:#000000;float:none;font:400 15px 'arial' , sans-serif;text-decoration-style:initial;text-indent:0px;text-transform:none;white-space:pre-wrap;word-spacing:0px">regards</span></div><div><br /></div><div><br /></div><div>03.11.2019, 22:15, "Tomas Vondra" <tomas(dot)vondra(at)2ndquadrant(dot)com>:</div><blockquote><p>On Sun, Nov 03, 2019 at 02:17:15PM +0300, Павел Ерёмин wrote:<br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote"> I completely agree with all of the above. Therefore, the proposed<br /> mechanism may entail larger improvements (and not only VACUUM).<br /></blockquote><p><br />I think the best think you can do is try implementing this ...<br /><br />I'm afraid the "improvements" essentially mean making various imporant<br />parts of the system much more complicated and expensive. There's a<br />trade-off between saving 8B per row and additional overhead (during<br />vacuum etc.), and it does not seem like a winning strategy. What started<br />as "we can simply look at the next row version" is clearly way more<br />complicated and expensive.<br /><br />The trouble here is that it adds dependency between pages in the data<br />file. That for example means that during cleanup of a page it may be <br />necessary to modify the other page, when originally that would be <br />read-only in that checkpoint interval. That's essentially write <br />amplification, and may significantly increase the amount of WAL due to <br />generating FPW for the other page.<br /><br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote"> I can offer the following solution.<br /> For VACUUM, create a hash table.<br /> VACUUM scanning the table sees that the version (tuple1) has t_ctid filled<br /> and refers to the address tuple2, it creates a structure into which it<br /> writes the address tuple1, tuple1.xid, length tuple1 (well, and other<br /> information that is needed), puts this structure in the hash table by key<br /> tuple2 addresses.<br /> VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if<br /> it finds it, it evaluates the connection between them and makes a decision<br /> on cleaning.<br /><br /></blockquote><p><br />We know VACUUM is already pretty expensive, so making it even more<br />expensive seems pretty awful. And the proposed solution seems damn<br />expensive. We already do something similar for indexes - we track<br />pointers for removed rows, so that we can remove them from indexes. And<br />it's damn expensive because we don't know where in the index the tuples<br />are - so we have to scan the whole indexes.<br /><br />This would mean we have to do the same thing for table, because we don't<br />know where in the table are the older versions of those rows, because we<br />don't know where the other rows are. That seems mighty expensive.<br /><br />Not to mention that this does nothing for page-level vacuum, which we<br />do when trying to fit another row on a page (e.g. for HOT). This has to<br />be absolutely cheap, we certainly are not going to do lookups of other<br />pages or looking for older versions of the row, and so on.<br /><br />Being able to do visibility decisions based on the tuple alone (or<br />possibly page-level + tuple information) has a lot of value, and I don't<br />think we want to make this more complicated.<br /><br />regards<br /><br /></p><span class="c18e9d485856a85513717a5a5b59d3fewmi-sign">-- <br />Tomas Vondra <a href="http://www.2ndquadrant.com/">http://www.2ndQuadrant.com</a><br />PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services <br /></span></blockquote>