Re: Minmax indexes

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Minmax indexes
Date: 2014-06-23 19:10:39
Message-ID: 53A87BAF.5020908@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/23/2014 08:07 PM, Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>> With 8k block size, that's just enough to cover the full range of
>> 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each
>> regular revmap page can store about 8192/6 = 1365 item pointers,
>> each array revmap page can store about 8192/4 = 2048 block
>> references, and the size of the top array is 8192/4. That's just
>> enough; to store the required number of array pages in the top
>> array, the array needs to be (2^32/1365)/2048)=1536 elements large.
>>
>> But with 4k or smaller blocks, it's not enough.
>
> Yeah. As I said elsewhere, actual useful values are likely to be close
> to the read-ahead setting of the underlying disk; by default that'd be
> 16 pages (128kB), but I think it's common advice to increase the kernel
> setting to improve performance.

My gut feeling is that it might well be best to set pages_per_page=1.
Even if you do the same amount of I/O, thanks to kernel read-ahead, you
might still avoid processing a lot of tuples. But would need to see some
benchmarks to know..

> I don't think we don't need to prevent
> minmax indexes with pages_per_range=1, but I don't think we need to
> ensure that that setting works with the largest tables, either, because
> it doesn't make any sense to set it up like that.
>
> Also, while there are some recommendations to set up a system with
> larger page sizes (32kB), I have never seen any recommendation to set it
> lower. It wouldn't make sense to build a system that has very large
> tables and use a smaller page size.
>
> So in other words, yes, you're correct that the mechanism doesn't work
> in some cases (small page size and index configured for highest level of
> detail), but the conditions are such that I don't think it matters.
>
> ISTM the thing to do here is to do the math at index creation time, and
> if we find that we don't have enough space in the metapage for all array
> revmap pointers we need, bail out and require the index to be created
> with a larger pages_per_range setting.

Yeah, I agree that would be acceptable.

I feel that the below would nevertheless be simpler:

>> I wonder if it would be simpler to just always store the revmap
>> pages in the beginning of the index, before any other pages. Finding
>> the revmap page would then be just as easy as with a separate fork.
>> When the table/index is extended so that a new revmap page is
>> needed, move the existing page at that block out of the way. Locking
>> needs some consideration, but I think it would be feasible and
>> simpler than you have now.
>
> Moving index items around is not easy, because you'd have to adjust the
> revmap to rewrite the item pointers.

Hmm. Two alternative schemes come to mind:

1. Move each index tuple off the page individually, updating the revmap
while you do it, until the page is empty. Updating the revmap for a
single index tuple isn't difficult; you have to do it anyway when an
index tuple is replaced. (MMTuples don't contain a heap block number
ATM, but IMHO they should, see below)

2. Store the new block number of the page that you moved out of the way
in the revmap page, and leave the revmap pointers unchanged. The revmap
pointers can be updated later, lazily.

Both of those seem pretty straightforward.

>>> I have followed the suggestion by Amit to overwrite the index tuple when
>>> a new heap tuple is inserted, instead of creating a separate index
>>> tuple. This saves a lot of index bloat. This required a new entry
>>> point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new
>>> function PageIndexDeleteNoCompact which is similar in spirit to
>>> PageIndexMultiDelete except that item pointers do not change. This is
>>> necessary because the revmap stores item pointers, and such reference
>>> would break if we were to renumber items in index pages.
>>
>> ISTM that when the old tuple cannot be updated in-place, the new
>> index tuple is inserted with mm_doinsert(), but the old tuple is
>> never deleted.
>
> It's deleted by the next vacuum.

Ah I see. Vacuum reads the whole index, and builds an in-memory hash
table that contains an ItemPointerData for every tuple in the index.
Doesn't that require a lot of memory, for a large index? That might be
acceptable - you ought to have plenty of RAM if you're pushing around
multi-terabyte tables - but it would nevertheless be nice to not have a
hard requirement for something as essential as vacuum.

In addition to the hash table, remove_deletable_tuples() pallocs an
array to hold an ItemPointer for every index tuple about to be removed.
A single palloc is limited to 1GB, so that will fail outright if there
are more than ~179 million dead index tuples. You're unlikely to hit
that in practice, but if you do, you'll never be able to vacuum the
index. So that's not very nice.

Wouldn't it be simpler to remove the old tuple atomically with inserting
the new tuple and updating the revmap? Or at least mark the old tuple as
deletable, so that vacuum can just delete it, without building the large
hash table to determine that it's deletable.

As it is, remove_deletable_tuples looks racy:

1. Vacuum begins, and remove_deletable_tuples performs the first pass
over the regular, non-revmap index pages, building the hash table of all
items in the index.

2. Another process inserts a new row to the heap, which causes a new
minmax tuple to be inserted and the revmap to be updated to point to the
new tuple.

3. Vacuum proceeds to scan the revmap. It will find the updated revmap
entry that points to the new index tuple. The new index tuples is not
found in the hash table, so it throws an error: "reverse map references
nonexistant (sic) index tuple".

I think to fix that you can just ignore tuples that are not found in the
hash table. (Although as I said above I think it would be simpler to not
leave behind any dead index tuples in the first place and get rid of the
vacuum scans altogether)

Regarding locking, I think it would be good to mention explicitly the
order that the pages must be locked if you need to lock multiple pages
at the same time, to avoid deadlock. Based on the Locking
considerations-section in the README, I believe the order is that you
always lock the regular index page first, and then the revmap page.
There's no mention of the order of locking two regular or two revmap
pages, but I guess you never do that ATM.

I'm quite surprised by the use of LockTuple on the index tuples. I think
the main reason for needing that is the fact that MMTuple doesn't store
the heap (range) block number that the tuple points to: LockTuple is
required to ensure that the tuple doesn't go away while a scan is
following a pointer from the revmap to it. If the MMTuple contained the
BlockNumber, a scan could check that and go back to the revmap if it
doesn't match. Alternatively, you could keep the revmap page locked when
you follow a pointer to the regular index page.

The lack of a block number on index tuples also makes my idea of moving
tuples out of the way of extending the revmap much more difficult;
there's no way to find the revmap entry pointing to an index tuple,
short of scanning the whole revmap. And also on general robustness
grounds, and for debugging purposes, it would be nice to have the block
number there.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-06-23 19:30:50 Re: checking for interrupts during heap insertion
Previous Message Abhijit Menon-Sen 2014-06-23 18:33:06 Re: pgaudit - an auditing extension for PostgreSQL