Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-07 06:15:47
Message-ID: d89d40ff-5eb7-574f-b33d-c432f323cb99@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/03/2019 04:03, Peter Geoghegan wrote:
> On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> I'm looking at the first patch in the series now. I'd suggest that you
>> commit that very soon. It's useful on its own, and seems pretty much
>> ready to be committed already. I don't think it will be much affected by
>> whatever changes we make to the later patches, anymore.

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?

* We store the binary search bounds in BTScanInsertData, but they're
only used during insertions.

* The binary search bounds are specific for a particular buffer. But
that buffer is passed around separately from the bounds. It seems easy
to have them go out of sync, so that you try to use the cached bounds
for a different page. The savebinsrch and restorebinsrch is used to deal
with that, but it is pretty complicated.

I came up with the attached (against master), which addresses the 2nd
and 3rd points. I added a whole new BTInsertStateData struct, to hold
the binary search bounds. BTScanInsert now only holds the 'scankeys'
array, and the 'nextkey' flag. The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.

I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
search like _bt_binsrch does, but the bounds caching is only done in
_bt_binsrch_insert. Seems more clear to have separate functions for them
now, even though there's some duplication.

>> +/* HEIKKI:
>> +Do we need 'checkunique' as an argument? If unique checks were not
>> +performed, the insertion key will simply not have saved state.
>> +*/
>
> We need it in the next patch in the series, because it's also useful
> for optimizing away the high key check with non-unique indexes. We
> know that _bt_moveright() was called at the leaf level, with scantid
> filled in, so there is no question of needing to move right within
> _bt_findinsertloc() (provided it's a heapkeyspace index).

Hmm. Perhaps it would be to move the call to _bt_binsrch (or
_bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So
that _bt_findinsertloc would only be responsible for finding the correct
page to insert to. So the overall code, after patch #2, would be like:

/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page. Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, &insertstate, stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, &insertstate);

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?

> Actually, we even need it in the first patch: we only restore a binary
> search because we know that there is something to restore, and must
> ask for it to be restored explicitly (anything else seems unsafe).
> Maybe we can't restore it because it's not a unique index, or maybe we
> can't restore it because we microvacuumed, or moved right to get free
> space. I don't think that it'll be helpful to make _bt_findinsertloc()
> pretend that it doesn't know exactly where the binary search bounds
> come from -- it already knows plenty about unique indexes
> specifically, and about how it may have to invalidate the bounds. The
> whole way that it couples buffer locks is only useful for unique
> indexes, so it already knows *plenty* about unique indexes
> specifically.

The attached new version simplifies this, IMHO. The bounds and the
current buffer go together in the same struct, so it's easier to keep
track whether the bounds are valid or not.

>> - * starting a regular index scan some can be omitted. The array is used as a
>> + * starting a regular index scan, some can be omitted. The array is used as a
>> * flexible array member, though it's sized in a way that makes it possible to
>> * use stack allocations. See nbtree/README for full details.
>> +
>> +HEIKKI: I don't see anything in the README about stack allocations. What
>> +exactly does the README reference refer to? No code seems to actually allocate
>> +this in the stack, so we don't really need that.
>
> The README discusses insertion scankeys in general, though. I think
> that you read it that way because you're focussed on my changes, and
> not because it actually implies that the README talks about the stack
> thing specifically. (But I can change it if you like.)
>
> There is a stack allocation in _bt_first(). This was once just another
> dynamic allocation, that called _bt_mkscankey(), but that regressed
> nested loop joins, so I had to make it work the same way as before.

Ah, gotcha, I missed that.

- Heikki

Attachment Content-Type Size
v14-heikki-2-0001-Refactor-nbtree-insertion-scankeys.patch text/x-patch 60.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-03-07 06:20:13 Re: Prevent extension creation in temporary schemas
Previous Message Masahiko Sawada 2019-03-07 06:02:59 Re: New vacuum option to do only freezing