Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>, "Andrey V(dot) Lepikhov" <a(dot)lepikhov(at)postgrespro(dot)ru>
Subject: Re: Making all nbtree entries unique by having heap TIDs participate in comparisons
Date: 2019-03-05 20:03:17
Message-ID: CAH2-WzkkXhhq7r-_YrME8CKQpjjvf3WAP6QzeN-FOLn2s9m39g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I'm looking at the first patch in the series now. I'd suggest that you
> commit that very soon. It's useful on its own, and seems pretty much
> ready to be committed already. I don't think it will be much affected by
> whatever changes we make to the later patches, anymore.

I agree that the parts covered by the first patch in the series are
very unlikely to need changes, but I hesitate to commit it weeks ahead
of the other patches. Some of the things that make _bt_findinsertloc()
fast are missing for v3 indexes. The "consider secondary factors
during nbtree splits" patch actually more than compensates for that
with v3 indexes, at least in some cases, but the first patch applied
on its own will slightly regress performance. At least, I benchmarked
the first patch on its own several months ago and noticed a small
regression at the time, though I don't have the exact details at hand.
It might have been an invalid result, because I wasn't particularly
thorough at the time.

We do make some gains in the first patch (the _bt_check_unique()
thing), but we also check the high key more than we need to within
_bt_findinsertloc() for non-unique indexes. Plus, the microvacuuming
thing isn't as streamlined.

It's a lot of work to validate and revalidate the performance of a
patch like this, and I'd rather commit the first three patches within
a couple of days of each other (I can validate v3 indexes and v4
indexes separately, though). We can put off the other patches for
longer, and treat them as independent. I guess I'd also push the final
amcheck patch following the first three -- no point in holding back on
that. Then we'd be left with "Add "split after new tuple"
optimization", and "Add high key "continuescan" optimization" as
independent improvements that can be pushed at the last minute of the
final CF.

> I also had a few comments and questions on some details. I added them in
> the same patch, marked with "HEIKKI:". Please take a look.

Will respond now. Any point that I haven't responding to directly has
been accepted.

> +HEIKKI: 'checkingunique' is a local variable in the function. Seems a bit
> +weird to talk about it in the function comment. I didn't understand what
> +the point of adding this sentence was, so I removed it.

Maybe there is no point in the comment you reference here, but I like
the idea of "checkingunique", because that symbol name is a common
thread between a number of functions that coordinate with each other.
It's not just a local variable in one function.

> @@ -588,6 +592,17 @@ _bt_check_unique(Relation rel, BTScanInsert itup_key,
> if (P_RIGHTMOST(opaque))
> break;
> highkeycmp = _bt_compare(rel, itup_key, page, P_HIKEY);
> +
> + /*
> + * HEIKKI: This assertion might fire if the user-defined opclass
> + * is broken. It's just an assertion, so maybe that's ok. With a
> + * broken opclass, it's obviously "garbage in, garbage out", but
> + * we should try to behave sanely anyway. I don't remember what
> + * our general policy on that is; should we assert, elog(ERROR),
> + * or continue silently in that case? An elog(ERROR) or
> + * elog(WARNING) would feel best to me, but I don't remember what
> + * we usually do.
> + */
> Assert(highkeycmp <= 0);
> if (highkeycmp != 0)
> break;

We don't really have a general policy on it. However, I don't have any
sympathy for the idea of trying to solider on with a corrupt index. I
also don't think that it's worth making this a "can't happen" error.
Like many of my assertions, this assertion is intended to document an
invariant. I don't actually anticipate that it could ever really fail.

> +Should we mention explicitly that this binary-search reuse is only applicable
> +if unique checks were performed? It's kind of implied by the fact that it's
> +_bt_check_unique() that saves the state, but perhaps we should be more clear
> +about it.

I guess so.

> +What is a "garbage duplicate"? Same as a "dead duplicate"?

Yes.

> +The last sentence, about garbage duplicates, seems really vague. Why do we
> +ever do any comparisons that are not strictly necessary? Perhaps it's best to
> +just remove that last sentence.

Okay -- will remove.

> +
> +HEIKKI: I don't buy the argument that microvacuuming has to happen here. You
> +could easily imagine a separate function that does microvacuuming, and resets
> +(or even updates) the binary-search cache in the insertion key. I agree this
> +is a convenient place to do it, though.

It wasn't supposed to be a water-tight argument. I'll just say that
it's convenient.

> +/* HEIKKI:
> +Do we need 'checkunique' as an argument? If unique checks were not
> +performed, the insertion key will simply not have saved state.
> +*/

We need it in the next patch in the series, because it's also useful
for optimizing away the high key check with non-unique indexes. We
know that _bt_moveright() was called at the leaf level, with scantid
filled in, so there is no question of needing to move right within
_bt_findinsertloc() (provided it's a heapkeyspace index).

Actually, we even need it in the first patch: we only restore a binary
search because we know that there is something to restore, and must
ask for it to be restored explicitly (anything else seems unsafe).
Maybe we can't restore it because it's not a unique index, or maybe we
can't restore it because we microvacuumed, or moved right to get free
space. I don't think that it'll be helpful to make _bt_findinsertloc()
pretend that it doesn't know exactly where the binary search bounds
come from -- it already knows plenty about unique indexes
specifically, and about how it may have to invalidate the bounds. The
whole way that it couples buffer locks is only useful for unique
indexes, so it already knows *plenty* about unique indexes
specifically.

I actually like the idea of making certain insertion scan key mutable
state relating to search bounds hidden in the case of "dynamic prefix
truncation" [1]. Doesn't seem to make sense here, though.

> + /* HEIKKI: I liked this comment that we used to have here, before this patch: */
> + /*----------
> + * If we will need to split the page to put the item on this page,
> + * check whether we can put the tuple somewhere to the right,
> + * instead. Keep scanning right until we

> + /* HEIKKI: Maybe it's not relevant with the later patches, but at least
> + * with just this first patch, it's still valid. I noticed that the
> + * comment is now in _bt_useduplicatepage, it seems a bit out-of-place
> + * there. */

I don't think it matters, because I don't think that the first patch
can be justified as an independent piece of work. I like the idea of
breaking up the patch series, because it makes it all easier to
understand, but the first three patches are kind of intertwined.

> +HEIKKI: In some scenarios, if the BTP_HAS_GARBAGE flag is falsely set, we would
> +try to microvacuum the page twice: first in _bt_useduplicatepage, and second
> +time here. That's because _bt_vacuum_one_page() doesn't clear the flag, if
> +there are in fact no LP_DEAD items. That's probably insignificant and not worth
> +worrying about, but I thought I'd mention it.

Right. It's also true that all future insertions will reach
_bt_vacuum_one_page() and do the same again, until there either is
garbage, or until the page splits.

> - * rightmost page case), all the items on the right half will be user data
> - * (there is no existing high key that needs to be relocated to the new
> - * right page).
> + * rightmost page case), all the items on the right half will be user
> + * data.
> + *
> +HEIKKI: I don't think the comment change you made here was needed or
> +helpful, so I reverted it.

I thought it added something when you're looking at it from a
WAL-logging point of view. But I can live without this.

> - * starting a regular index scan some can be omitted. The array is used as a
> + * starting a regular index scan, some can be omitted. The array is used as a
> * flexible array member, though it's sized in a way that makes it possible to
> * use stack allocations. See nbtree/README for full details.
> +
> +HEIKKI: I don't see anything in the README about stack allocations. What
> +exactly does the README reference refer to? No code seems to actually allocate
> +this in the stack, so we don't really need that.

The README discusses insertion scankeys in general, though. I think
that you read it that way because you're focussed on my changes, and
not because it actually implies that the README talks about the stack
thing specifically. (But I can change it if you like.)

There is a stack allocation in _bt_first(). This was once just another
dynamic allocation, that called _bt_mkscankey(), but that regressed
nested loop joins, so I had to make it work the same way as before. I
noticed this about six months ago, because there was a clear impact on
the TPC-C "Stock level" transaction, which is now sometimes twice as
fast with the patch series. Note also that commit d961a568, from 2005,
changed the _bt_first() code to use a stack allocation. Besides,
sticking to a stack allocation makes the changes to _bt_first()
simpler, even though it has to duplicate a few things from
_bt_mkscankey().

I could get you a v15 that integrates your changes pretty quickly, but
I'll hold off on that for at least a few days. I have a feeling that
you'll have more feedback for me to work through before too long.

[1] https://postgr.es/m/CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2019-03-05 20:21:21 Re: Rare SSL failures on eelpout
Previous Message Paul Ramsey 2019-03-05 19:58:08 Re: Allowing extensions to supply operator-/function-specific info