Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2018-09-28 14:50:33
Message-ID: 6d83ec20-2a55-c051-0159-e92d39bdfe20@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 19/09/2018 20:23, Peter Geoghegan wrote:
> Attached is v5,

So. I don't know much about the btree code, so don't believe anything I
say.

I was very interested in the bloat test case that you posted on
2018-07-09 and I tried to understand it more. The current method for
inserting a duplicate value into a btree is going to the leftmost point
for that value and then move right until we find some space or we get
"tired" of searching, in which case just make some space right there.
The problem is that it's tricky to decide when to stop searching, and
there are scenarios when we stop too soon and repeatedly miss all the
good free space to the right, leading to bloat even though the index is
perhaps quite empty.

I tried playing with the getting-tired factor (it could plausibly be a
reloption), but that wasn't very successful. You can use that to
postpone the bloat, but you won't stop it, and performance becomes terrible.

You propose to address this by appending the tid to the index key, so
each key, even if its "payload" is a duplicate value, is unique and has
a unique place, so we never have to do this "tiresome" search. This
makes a lot of sense, and the results in the bloat test you posted are
impressive and reproducible.

I tried a silly alternative approach by placing a new duplicate key in a
random location. This should be equivalent since tids are effectively
random. I didn't quite get this to fully work yet, but at least it
doesn't blow up, and it gets the same regression test ordering
differences for pg_depend scans that you are trying to paper over. ;-)

As far as the code is concerned, I agree with Andrey Lepikhov that one
more abstraction layer that somehow combines the scankey and the tid or
some combination like that would be useful, instead of passing the tid
as a separate argument everywhere.

I think it might help this patch move along if it were split up a bit,
for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
That way, it would also be easier to test out each piece separately.
For example, how much space does suffix truncation save in what
scenario, are there any performance regressions, etc. In the last few
versions, the patches have still been growing significantly in size and
functionality, and most of the supposed benefits are not readily visible
in tests.

And of course we need to think about how to handle upgrades, but you
have already started a separate discussion about that.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2018-09-28 15:00:38 Re: On-disk compatibility for nbtree-unique-key enhancement
Previous Message Tomas Vondra 2018-09-28 14:40:44 Re: heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)