Re: WIP: Covering + unique indexes.

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, David Steele <david(at)pgmasters(dot)net>, Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Aleksander Alekseev <a(dot)alekseev(at)postgrespro(dot)ru>
Subject: Re: WIP: Covering + unique indexes.
Date: 2017-03-31 23:31:24
Message-ID: CAH2-WzkT-MeSnHozqoxmmNthVRK9O06rKjCyZZeUri0X41imvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I had a quick look at this on the flight back from PGConf.US.

On Fri, Mar 31, 2017 at 10:40 AM, Anastasia Lubennikova
<a(dot)lubennikova(at)postgrespro(dot)ru> wrote:
> But I haven't considered the possibility of index_form_tuple() failure.
> Fixed in this version of the patch. Now it creates a copy of tupledesc to
> pass it to index_form_tuple.

I think that we need to be 100% sure that index_truncate_tuple() will
not generate an IndexTuple that is larger than the original.
Otherwise, you could violate the "1/3 of page size exceeded" thing. We
need to catch that when the user actually inserts an oversized value.
After that, it's always too late. (See my remarks to Tom on other
thread about this, too.)

> We'd discussed with other reviewers, they suggested index_truncate_tuple()
> instead of index_reform_tuple().
> I think that this name reflects the essence of the function clear enough and
> don't feel like renaming it again.

+1.

Feedback so far:

* index_truncate_tuple() should have as an argument the number of
attributes. No need to "#include utils/rel.h" that way.

* I think that we should store this (the number of attributes), and
use it directly when comparing, per my remarks to Tom over on that
other thread. We should also use the free bit within
IndexTupleData.t_info, to indicate that the IndexTuple was truncated,
just to make it clear to everyone that might care that that's how
these truncated IndexTuples need to be represented.

Doing this would have no real impact on your patch, because for you
this will be 100% redundant. It will help external tools, and perhaps
another, more general suffix truncation patch that comes in the
future. We should try very hard to have a future-proof on-disk
representation. I think that this is quite possible.

* I suggest adding a "can't happen" defensive check + error that
checks that the tuple returned by index_truncate_tuple() is sized <=
the original. This cannot be allowed to ever happen. (Not that I think
it will.)

* I see a small bug. You forgot to teach _bt_findsplitloc() about
truncation. It does this currently, which you did not update:

/*
* The first item on the right page becomes the high key of the left page;
* therefore it counts against left space as well as right space.
*/
leftfree -= firstrightitemsz;

I think that this accounting needs to be fixed.

* Note sure about one thing. What's the reason for this change?

> - /* Log left page */
> - if (!isleaf)
> - {
> - /*
> - * We must also log the left page's high key, because the right
> - * page's leftmost key is suppressed on non-leaf levels. Show it
> - * as belonging to the left page buffer, so that it is not stored
> - * if XLogInsert decides it needs a full-page image of the left
> - * page.
> - */
> - itemid = PageGetItemId(origpage, P_HIKEY);
> - item = (IndexTuple) PageGetItem(origpage, itemid);
> - XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
> - }
> + /*
> + * We must also log the left page's high key, because the right
> + * page's leftmost key is suppressed on non-leaf levels. Show it
> + * as belonging to the left page buffer, so that it is not stored
> + * if XLogInsert decides it needs a full-page image of the left
> + * page.
> + */
> + itemid = PageGetItemId(origpage, P_HIKEY);
> + item = (IndexTuple) PageGetItem(origpage, itemid);
> + XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));

Is this related to the problem that you mentioned to me that you'd
fixed when we spoke in person earlier today? You said something about
WAL logging, but I don't recall any details. I don't remember seeing
this change in prior versions.

Anyway, whatever the reason for doing this on the leaf level now, the
comments should be updated to explain it.

* Speaking of WAL-logging, I think that there is another bug in
btree_xlog_split(). You didn't change this existing code at all:

/*
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
if (isleaf)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));

left_hikey = PageGetItem(rpage, hiItemId);
left_hikeysz = ItemIdGetLength(hiItemId);
}

It seems like this was missed when you changed WAL-logging, since you
do something for this on the logging side, but not here, on the replay
side. No?

That's all I have for now. Maybe I can look again later, or tomorrow.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2017-03-31 23:41:15 Re: [bug fix] Savepoint-related statements terminates connection
Previous Message Tom Lane 2017-03-31 23:20:46 Re: Somebody has not thought through subscription locking considerations