Re: 64 bit transaction id

From: Павел Ерёмин <shnoor111gmail(at)yandex(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 64 bit transaction id
Date: 2019-11-03 11:17:15
Message-ID: 56862891572779835@sas8-004ac07304b9.qloud-c.yandex.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

<div><div>I completely agree with all of the above. Therefore, the proposed mechanism may entail larger improvements (and not only VACUUM).</div><div>I can offer the following solution.</div><div>For VACUUM, create a hash table.</div><div>VACUUM scanning the table sees that the version (tuple1) has t_ctid filled and refers to the address tuple2, it creates a structure into which it writes the address tuple1, tuple1.xid, length tuple1 (well, and other information that is needed), puts this structure in the hash table by key tuple2 addresses.</div><div>VACUUM reaches tuple2, checks the address of tuple2 in the hash table - if it finds it, it evaluates the connection between them and makes a decision on cleaning.</div><div> </div><div> </div><div><span style="background:white;color:black;font:11.5pt 'arial' , sans-serif">regards</span></div></div><div><br /></div><div><br /></div><div>03.11.2019, 02:20, "Tomas Vondra" &lt;tomas(dot)vondra(at)2ndquadrant(dot)com&gt;:</div><blockquote><p>On Sat, Nov 02, 2019 at 11:35:09PM +0300, Павел Ерёмин wrote:<br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote">   The proposed option is not much different from what it is now.<br />   We are not trying to save some space - we will reuse the existing one. We<br />   just work in 64 bit transaction counters. Correct me if I'm wrong - the<br />   address of the next version of the line is stored in the 6 byte field<br />   t_cid in the tuple header - which is not attached to the current page in<br />   any way - and can be stored anywhere in the table. Nothing changes.<br /></blockquote><p><br />I think you mean t_ctid, not t_cid (which is a 4-byte CommandId, not any<br />sort of item pointer).<br /><br />I think this comment from htup_details.h explains the issue:<br /><br /> * ... Beware however that VACUUM might<br /> * erase the pointed-to (newer) tuple before erasing the pointing (older)<br /> * tuple. Hence, when following a t_ctid link, it is necessary to check<br /> * to see if the referenced slot is empty or contains an unrelated tuple.<br /> * Check that the referenced tuple has XMIN equal to the referencing tuple's<br /> * XMAX to verify that it is actually the descendant version and not an<br /> * unrelated tuple stored into a slot recently freed by VACUUM. If either<br /> * check fails, one may assume that there is no live descendant version.<br /><br />Now, imagine you have a tuple that gets updated repeatedly (say, 3x) and<br />each version gets to a different page. Say, pages #1, #2, #3. And then<br />VACUUM happens on some of the "middle" page (this may happen when trying<br />to fit new row onto a page to allow HOT, but it might happen even during<br />regular VACUUM).<br /><br />So we started with 3 tuples on pages #1, #2, #3, but now we have this<br /><br />  #1 - tuple exists, points to tuple on page #2<br />  #2 - tuple no longer exists, cleaned up by vacuum<br />  #3 - tuple exists<br /><br />The scheme you proposed requires existence of all the tuples in the<br />chain to determine visibility. When tuple #2 no longer exists, it's<br />impossible to decide whether tuple on page #1 is visible or not.<br /><br />This also significantly increases the amount of random I/O, pretty much<br />by factor of 2, because whenever you look at a row, you also have to<br />look at the "next version" which may be on another page. That's pretty<br />bad, bot for I/O and cache hit ratio. I don't think that's a reasonable<br />trade-off (at least compared to simply making the XIDs 64bit).<br /><br /><br />regards<br /><br /></p><span class="c18e9d485856a85513717a5a5b59d3fewmi-sign">-- <br />Tomas Vondra <a href="http://www.2ndquadrant.com/">http://www.2ndQuadrant.com</a><br />PostgreSQL Development, 24x7 Support, Remote DBA, Training &amp; Services <br /></span></blockquote>

Attachment Content-Type Size
unknown_filename text/html 3.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dent John 2019-11-03 11:51:14 Re: The flinfo->fn_extra question, from me this time.
Previous Message Tels 2019-11-03 09:24:43 Re: pglz performance