Re: Minmax indexes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minmax indexes
Date: 2014-07-29 11:33:20
Message-ID: 53D78680.2020001@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 07/10/2014 12:41 AM, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> On 06/23/2014 08:07 PM, Alvaro Herrera wrote:
>
>> I feel that the below would nevertheless be simpler:
>>
>>>> I wonder if it would be simpler to just always store the revmap
>>>> pages in the beginning of the index, before any other pages. Finding
>>>> the revmap page would then be just as easy as with a separate fork.
>>>> When the table/index is extended so that a new revmap page is
>>>> needed, move the existing page at that block out of the way. Locking
>>>> needs some consideration, but I think it would be feasible and
>>>> simpler than you have now.
>>>
>>> Moving index items around is not easy, because you'd have to adjust the
>>> revmap to rewrite the item pointers.
>>
>> Hmm. Two alternative schemes come to mind:
>>
>> 1. Move each index tuple off the page individually, updating the
>> revmap while you do it, until the page is empty. Updating the revmap
>> for a single index tuple isn't difficult; you have to do it anyway
>> when an index tuple is replaced. (MMTuples don't contain a heap
>> block number ATM, but IMHO they should, see below)
>>
>> 2. Store the new block number of the page that you moved out of the
>> way in the revmap page, and leave the revmap pointers unchanged. The
>> revmap pointers can be updated later, lazily.
>>
>> Both of those seem pretty straightforward.
>
> The trouble I have with moving blocks around to make space, is that it
> would cause the index to have periodic hiccups to make room for the new
> revmap pages. One nice property that these indexes are supposed to have
> is that the effect into insertion times should be pretty minimal. That
> would cease to be the case if we have to do your proposed block moves.

Approach 2 above is fairly quick, quick enough that no-one would notice
the "hiccup". Moving the tuples individually (approach 1) would be slower.

>>>> ISTM that when the old tuple cannot be updated in-place, the new
>>>> index tuple is inserted with mm_doinsert(), but the old tuple is
>>>> never deleted.
>>>
>>> It's deleted by the next vacuum.
>>
>> Ah I see. Vacuum reads the whole index, and builds an in-memory hash
>> table that contains an ItemPointerData for every tuple in the index.
>> Doesn't that require a lot of memory, for a large index? That might
>> be acceptable - you ought to have plenty of RAM if you're pushing
>> around multi-terabyte tables - but it would nevertheless be nice to
>> not have a hard requirement for something as essential as vacuum.
>
> I guess if you're expecting that pages_per_range=1 is a common case,
> yeah it might become an issue eventually.

Not sure, but I find it easier to think of the patch that way. In any
case, it would be nice to avoid the problem, even if it's not common.

> One idea I just had is to
> have a bit for each index tuple, which is set whenever the revmap no
> longer points to it. That way, vacuuming is much easier: just scan the
> index and delete all tuples having that bit set.

The bit needs to be set atomically with the insertion of the new tuple,
so why not just remove the old tuple right away?

>> Wouldn't it be simpler to remove the old tuple atomically with
>> inserting the new tuple and updating the revmap? Or at least mark
>> the old tuple as deletable, so that vacuum can just delete it,
>> without building the large hash table to determine that it's
>> deletable.
>
> Yes, it might be simpler, but it'd require dirtying more pages on
> insertions (and holding more page-level locks, for longer. Not good for
> concurrent access).

I wouldn't worry much about the performance and concurrency of this
operation. Remember that the majority of updates are expected to not
have to update the index, otherwise the minmax index will degenerate
quickly and performance will suck anyway. And even when updating the
index is needed, in most cases the new tuple fits on the same page,
after removing the old one. So the case where you have to insert a new
index tuple, remove old one (or mark it dead), and update the revmap to
point to the new tuple, is rare.

>> I'm quite surprised by the use of LockTuple on the index tuples. I
>> think the main reason for needing that is the fact that MMTuple
>> doesn't store the heap (range) block number that the tuple points
>> to: LockTuple is required to ensure that the tuple doesn't go away
>> while a scan is following a pointer from the revmap to it. If the
>> MMTuple contained the BlockNumber, a scan could check that and go
>> back to the revmap if it doesn't match. Alternatively, you could
>> keep the revmap page locked when you follow a pointer to the regular
>> index page.
>
> There's the intention that these accesses be kept as concurrent as
> possible; this is why we don't want to block the whole page. Locking
> individual TIDs is fine in this case (which is not in SELECT FOR UPDATE)
> because we can only lock a single tuple in any one index scan, so
> there's no unbounded growth of the lock table.
>
> I prefer not to have BlockNumbers in index tuples, because that would
> make them larger for not much gain. That data would mostly be
> redundant, and would be necessary only for vacuuming.

Don't underestimate the value of easier debugging. I wouldn't worry much
about shaving four bytes from the tuple, these indexes are tiny in any
case. Keep it simple at first, and optimize later if necessary.

In fact, I'd suggest just using normal IndexTuple instead of the custom
MMTuple struct, store the block number in t_tid and leave offset number
field of that unused. That wastes 2 more bytes per tuple, but that's
insignificant too. I feel that it probably would be worth it just to
keep thing simple, and you'd e.g. be able to use index_deform_tuple() as is.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2014-07-29 12:26:57 Re: postgresql.auto.conf and reload
Previous Message Heikki Linnakangas 2014-07-29 10:30:12 Re: Production block comparison facility